Classification and Clustering: Problems for the Future

This paper reviews various basic achievements in classification during the last fifteen years and points to a series of unsolved mathematical, statistical and applied problems. It suggests the investigation of new methodological aspects, a better adaptation between methods and applications, the extension of cluster and data analysis into fields like information processing, machine learning and artificial intelligence, and a formal investigation of information retrieval problems in the clustering and database framework. Furthermore, we comment on computational aspects and software tools required for future applications.

[1]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[2]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[3]  Teuvo Kohonen,et al.  STATISTICAL PATTERN RECOGNITION REVISITED , 1990 .

[4]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[5]  Régis Gras,et al.  L'implication statistique, une nouvelle méthode d'analyse de données , 1991 .

[6]  Adolfo J. Quiroz,et al.  Fast random generation of binary, t-ary and other types of trees , 1989 .

[7]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[8]  F. Murtaghl,et al.  The Multilayer Perceptron for Discriminant Analysis: Two Examples , 1992 .

[9]  J. Felsenstein,et al.  Invariants of phylogenies in a simple case with discrete states , 1987 .

[10]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[11]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[12]  Béla Bollobás,et al.  Random Graphs , 1985 .

[13]  B. Cutsem,et al.  Some New Useful Representations of Dissimilarities in Mathematical Classification , 1993 .

[14]  Phipps Arabie,et al.  The bond energy algorithm revisited , 1990, IEEE Trans. Syst. Man Cybern..

[15]  B. Lausen Statistical Analysis of Genetic Distance Data , 1991 .

[16]  W. Vent Leuschner, Dieter, Einführung in die numerische Taxonomie. 139 S., 27 Abb. VEB Gustav Fischer Verlag. Jena, 1974. L 7. Br. Preis: 29,70 M , 1976 .

[17]  Jiri Panyr Conceptual Clustering and Relevance Feedback , 1987 .

[18]  Ingetraut Dahlberg,et al.  Universal classification : subject analysis and ordering systems : proceedings, 4th Internatl. Study Conference on Classification Research, 6th Annual Conference of Gesellschaft für Klassifikation e.V., Augsburg, 28 June-2 July 1982 , 1982 .

[19]  Rudolf Wille,et al.  Line diagrams of hierarchical concept systems , 1984 .

[20]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[21]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[22]  G. Herden Dissimilarity coefficients which are independent of a special set of data , 1990 .

[23]  Michael D. Hendy,et al.  Significance of the length of the shortest tree , 1992 .

[24]  L. Goldstein,et al.  Poisson approximation and dna sequence matching , 1990 .

[25]  Floriana Esposito Automated Acquisition of Production Rules by Empirical Supervised Learning Methods , 1990 .

[26]  J. Hartigan,et al.  Statistical Analysis of Hominoid Molecular Evolution , 1987 .

[27]  Robert M. Losee,et al.  Seven fundamental questions for the science of library classification , 1993 .

[28]  I. C. Lerman,et al.  Les bases de la classification automatique , 1971 .

[29]  Wayne S. DeSarbo,et al.  A simulated annealing methodology for clusterwise linear regression , 1989 .

[30]  G. Herden Some aspects of qualitative data analysis , 1993 .

[31]  R. Holley Classification in the USA , 1986 .

[32]  Norbert Fuhr,et al.  Representations, Models and Abstractions in Probabilistic Information Retrieval , 1993 .

[33]  Otto Opitz,et al.  Information and Classification , 1993 .

[34]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[35]  M. Narasimha Murty,et al.  A knowledge-based clustering scheme , 1987, Pattern Recognit. Lett..

[36]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[37]  Bharat K. Bhargava,et al.  Tree Systems for Syntactic Pattern Recognition , 1973, IEEE Transactions on Computers.

[38]  Erhard Godehardt,et al.  The Testing of Data Structures with Graph-Theoretical Models , 1994 .

[39]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[40]  L. Foulds,et al.  Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences , 1982, Nature.

[41]  Edwin Diday Knowledge Representation and Symbolic Data Analysis , 1990 .

[42]  E. Godehardt,et al.  Multigraphs for the Uncovering and Testing of Structures , 1991 .

[43]  O. Frank Multiple Relation Data Analyses , 1987 .

[44]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[45]  Jean Diatta,et al.  From Apresjan Hierarchies and Bandelt-Dress Weak hierarchies to Quasi-hierarchies , 1994 .

[46]  B. Silverman,et al.  Using Kernel Density Estimates to Investigate Multimodality , 1981 .

[47]  Shiyali Ramamrita Ranganathan,et al.  The colon classification , 1965 .

[48]  Peter Willett,et al.  Similarity Searching in Databases of Three-Dimensional Chemical Structures , 1994 .

[49]  Nicholas C. Wormald,et al.  On the Distribution of Lengths of Evolutionary Trees , 1990, SIAM J. Discret. Math..

[50]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[51]  Hans-Hermann Bock,et al.  Classification and Related Methods of Data Analysis , 1988 .

[52]  King-Sun Fu,et al.  A Clustering Procedure for Syntactic Patterns , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[53]  Robert E. Tarjan An Improved Algorithm for Hierarchical Clustering Using Strong Components , 1983, Inf. Process. Lett..

[54]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[55]  M. Waterman,et al.  Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons , 1992 .

[56]  Richard E. Neapolitan,et al.  Probabilistic reasoning in expert systems - theory and algorithms , 2012 .

[57]  E. Godehardt Graphs as Structural Models: The Application of Graphs and Multigraphs in Cluster Analysis , 1988 .

[58]  Martin Schader,et al.  Knowledge, Data and Computer-Assisted Decisions , 1990, NATO ASI Series.

[59]  Hans-Hermann Bock,et al.  Classification, Data Analysis, and Knowledge Organization , 1991 .

[60]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[61]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[62]  Joseph Felsenstein,et al.  Statistical inference of phylogenies , 1983 .

[63]  P. Boeck,et al.  Hierarchical classes: Model and data analysis , 1988 .

[64]  Melvin F. Janowitz,et al.  Ordinal and percentile clustering , 1989 .

[65]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[66]  S. Miyamoto Fuzzy Graphs as a Basic Tool for Agglomerative Clustering and Information Retrieval , 1993 .

[67]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[68]  B. Everitt Unresolved Problems in Cluster Analysis , 1979 .

[69]  K. Schleifer,et al.  Phylogenetic Studies by Comparative Sequence Analysis of Evolutionary Conserved Macromolecules , 1992 .

[70]  Frank Critchley,et al.  An order-theoretic unification and generalisation of certain fundamental bijections in mathematical classification. I , 1994 .

[71]  Gheorghe Tecuci,et al.  Learning Based on Conceptual Distance , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Pierre Hansen,et al.  Partitioning Problems in Cluster Analysis: A Review of Mathematical Programming Approaches , 1994 .

[73]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[74]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[75]  A. von Haeseler,et al.  Phylogenetic inference: linear invariants and maximum likelihood. , 1993, Biometrics.

[76]  F. Murtagh Neural networks and related Massively parallel' methods for statistics: a short overview , 1994 .

[77]  B. Jaumard,et al.  Efficient algorithms for divisive hierarchical clustering with the diameter criterion , 1990 .

[78]  Helmuth Spaeth,et al.  Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion , 1975 .

[79]  G. Sawitzki,et al.  Excess Mass Estimates and Tests for Multimodality , 1991 .

[80]  B. Bollobás The evolution of random graphs , 1984 .

[81]  W. Vach Least squares approximation of addititve trees , 1989 .

[82]  Donald J. McDonell Classification and their keys , 1978 .

[83]  Yoshiko Wakabayashi,et al.  A cutting plane algorithm for a clustering problem , 1989, Math. Program..

[84]  A. Müller,et al.  Classification with neural networks , 1991 .

[85]  Adele Cutler,et al.  Information Ratios for Validating Mixture Analysis , 1992 .

[86]  G. Herden Cluster Methods for Qualitative Data , 1989 .

[87]  George Cybenko,et al.  Complexity Theory of Neural Networks and Classification Problems , 1990, EURASIP Workshop.

[88]  P. Degens,et al.  Variance Estimation in the Additive Tree Model , 1991 .

[89]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[90]  Louis Hodes,et al.  Clustering a large number of compounds. 1. Establishing the method on an initial sample , 1989, J. Chem. Inf. Comput. Sci..

[91]  Paul O. Degens,et al.  Hierarchical Cluster Methods as Maximum Likelihood Estimators , 1983 .

[92]  R. Eckmiller Advanced neural computers , 1990 .

[93]  Otto Optiz,et al.  Conceptual and Numerical Analysis of Data , 1989 .

[94]  P. Simons,et al.  Philosophische Aspekte der Klassifikation , 1992 .

[95]  H. Bock On some significance tests in cluster analysis , 1985 .

[96]  S F Altschul,et al.  Significance levels for biological sequence comparison using non-linear similarity functions. , 1988, Bulletin of mathematical biology.

[97]  Ryszard S. Michalski,et al.  A Theory and Methodology of Inductive Learning , 1983, Artificial Intelligence.

[98]  Guy W. Mineau,et al.  Improving Consistency Within Knowledge Bases , 1990 .

[99]  Jack Sutcliffe,et al.  Concept, Class, And Category In The Tradition Of Aristotle , 1993 .

[100]  Edwin Diday,et al.  A Recent Advance in Data Analysis: Clustering Objects into Classes Characterized by Conjunctive Concepts , 1981 .

[101]  E. Diday Une représentation visuelle des classes empiétantes: les pyramides , 1986 .

[102]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[103]  W. Ludwig Structure and Phylogenetic Information of Large Subunit Ribosomal RNA , 1992 .

[104]  Bernhard Ganter,et al.  Beiträge Zur Begriffsanalyse Vorträge der Arbeitstagung Begriffsanalyse, Darmstadt 1986 , 1987 .

[105]  Carlo Misiak Cluster and Classify: A Conceptual Approach , 1990 .

[106]  Bruno Leclerc,et al.  Ensembles Ordonnes Et Taxonomie Mathematique , 1984 .

[107]  I. Rival Algorithms and Order , 1988 .

[108]  M. Narasimha Murty,et al.  Structural aspects of semantic-directed clusters , 1989, Pattern Recognit..

[109]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[110]  D Penny,et al.  Estimating the reliability of evolutionary trees. , 1986, Molecular biology and evolution.

[111]  Carole Durand-Lepoivre Ordres et graphes pseudo-hiérarchiques : théorie et optimisation algorithmique , 1989 .

[112]  M. F. Janowitz,et al.  An Order Theoretic Model for Cluster Analysis , 1978 .

[113]  Martin Schader,et al.  Analyzing and Modeling Data and Knowledge , 1992 .

[114]  R. F. Ling A Probability Theory of Cluster Analysis , 1973 .

[115]  Bernard Van Cutsem,et al.  Classification And Dissimilarity Analysis , 1994 .

[116]  Rudolf Wille,et al.  Lattices in Data Analysis: How to Draw Them with a Computer , 1989 .

[117]  F. Roush Les arbres et les representations des proximites : J.-P. Barthelemy and A. Guenoche, Paris: Masson, 1988, 236 pages, 160 francs. , 1989 .

[118]  John M. Barnard,et al.  Clustering of chemical structures on the basis of two-dimensional similarity measures , 1992, J. Chem. Inf. Comput. Sci..

[119]  Azriel Rosenfeld,et al.  Progress in pattern recognition , 1985 .

[120]  Zygmunt Dobrowolski,et al.  Étude sur la construction des systèmes de classification , 1964 .

[121]  Phipps Arabie,et al.  Constructing blockmodels: How and why , 1978 .

[122]  D. J. Strauss,et al.  Pseudolikelihood Estimation for Social Networks , 1990 .

[123]  Stanley L. Sclove,et al.  Multivariate statistical modeling , 1994 .

[124]  Ingetraut Dahlberg,et al.  Grundlagen universaler Wissensordnung: Probleme und Möglichkeiten eines universalen Klassifikationssystems des Wissens , 1974 .

[125]  C. J. Jardine,et al.  The structure and construction of taxonomic hierarchies , 1967 .

[126]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[127]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[128]  J. Barthelemy,et al.  Median graphs and tree analysis of dichotomous data, an approach to qualitative factor analysis , 1989 .

[129]  Hans-Hermann Bock,et al.  Probabilistic Aspects in Cluster Analysis , 1989 .

[130]  E. Diday An Introduction to symbolic data analysis , 1993 .

[131]  Jerzy Jaworski,et al.  On a Random Digraph , 1987 .

[132]  Robert M. Miura,et al.  Some mathematical questions in biology : DNA sequence analysis , 1986 .

[133]  P. Arabie,et al.  Mapclus: A mathematical programming approach to fitting the adclus model , 1980 .

[134]  B. C. Vickery,et al.  Faceted classification schemes , 1966 .

[135]  M. P. Windham,et al.  Information-Based Validity Functionals for Mixture Analysis , 1994 .

[136]  Sadaaki Miyamoto,et al.  Fuzzy Sets in Information Retrieval and Cluster Analysis , 1990, Theory and Decision Library.

[137]  Fred R. McMorris,et al.  Discovering Consensus Molecular Sequences , 1993 .

[138]  Paul De Boeck,et al.  Projection of a binary criterion into a model of hierarchical classes , 1990 .

[139]  Gerhard Herden Some Aspects of Clustering Functions , 1984 .

[140]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[141]  Wilfried Lex A Representation of Concepts for their Computerization , 1987 .

[142]  C. Ribeiro,et al.  Clustering and clique partitioning: Simulated annealing and tabu search approaches , 1992 .

[143]  P. Dostrnann,et al.  Automatkche Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten. (Cluster-Analyse). Von H. H. Bock. Vandenhoeck & Ruprecht, Gottingen-Zürich 1974. 1. Aufl., 480 S., 54 Abb., Ln. DM 82,– , 1975 .

[144]  Steven Wayne White,et al.  Computational methods for physical mapping of chromosomes , 1990 .

[145]  P. Bertrand,et al.  Propriétés et caractérisations topologiques d'une représentation pyramidale , 1992 .

[146]  King-Sun Fu,et al.  A Sentence-to-Sentence Clustering Procedure for Pattern Analysis , 1978, IEEE Transactions on Systems, Man, and Cybernetics.