Of maps bigger than the empire

In a passage by J.L. Borges on the "exactitude of Science," a fictitious author describes an Empire in which the art of Cartography "logro tal perfeccion que el mapa de una sola Provincia ocupaba toda la Ciudad, y el mapa del Imperio toda una Provincia." With time, these huge maps wouldn't be enough, and the Colleges of the Cartographers erected a map of the Empire that equalled in width the Empire itself... This paper concerns itself with increasing cases of pattern discovery and data mining in which synopses, indices and relationships thereof seem to grow faster and bigger than the phenomena they were meant to encapsulate. The paper then reviews specific examples of algorithmic and combinatorial constructs that proved capable of alleviating such paradoxes in the author's recent work experience.

[1]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[2]  Laxmi Parida,et al.  An Output-Sensitive Flexible Pattern Discovery Algorithm , 2001, CPM.

[3]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[4]  Alberto Apostolico Notes on Learning Probabilistic Automata , 2000, Data Compression Conference.

[5]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[6]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[7]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[8]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[9]  Alberto Apostolico,et al.  Optimal Amnesic Probabilistic Automata or How to Learn and Classify Proteins in Linear Time and Space , 2000, J. Comput. Biol..

[10]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[11]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[12]  S. Forchhammer,et al.  Coding with partially hidden Markov models , 1995, Proceedings DCC '95 Data Compression Conference.

[13]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[14]  Stefano Lonardi,et al.  Linear global detectors of redundant and rare substrings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[15]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.