Text Mining Using Data Compression Models

The idea of using data compression algorithms for machine learning has been reinvented many times. Intuitively, compact representations of data are possible only if statistical regularities exist in the data. Compression algorithms identify such patterns and build statistical models to describe them. This ability to learn patterns from data makes compression methods instantly attractive for machine learning purposes. In this thesis, we propose several novel text mining applications of data compression algorithms. We introduce a compression-based method for instance selection, capable of extracting a diverse subset of documents that are representative of a larger document collection. The quality of the sample is measured by how well a compression model, trained from the subset, is able to predict held-out reference data. The method is useful for initializing k-means clustering, and as a pool-based active learning strategy for supervised training of text classifiers. When using compression models for classification, we propose that trained models should be sequentially adapted when evaluating the probability of the classified document. We justify this approach in terms of the minimum description length principle, and show that adaptation improves performance for online filtering of email spam. Our research contributes to the state-of-the-art of applied machine learning in two significant application domains. We propose the use of compression models for spam filtering, and show that compression-based filters are superior to traditional tokenization-based filters and competitive with the best known methods for this task. We also consider the use of compression models for lexical stress assignment, a problem in Slovenian speech synthesis, and demonstrate that compression models perform well on this task, while requiring fewer resources than competing methods. The topic of this thesis is text mining. However, most of the proposed methods are more general, and are designed for learning with arbitrary discrete sequences.

[1]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[2]  Marcus Hutter Universal Learning Theory , 2010, Encyclopedia of Machine Learning.

[3]  Sarah Jane Delany,et al.  Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches , 2006, Artificial Intelligence Review.

[4]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[5]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[6]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[7]  ChengXiang Zhai,et al.  Active Feedback - UIUC TREC-2003 HARD Experiments , 2003, TREC.

[8]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[9]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[10]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[11]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[12]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[13]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[14]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[15]  Mehmet M. Dalkilic,et al.  Using Compression to Identify Classes of Inauthentic Texts , 2006, SDM.

[16]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[17]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[20]  D. Sculley,et al.  Relaxed Online SVMs in the TREC Spam Filtering Track , 2007, TREC.

[21]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[22]  Timothy J. Hazen,et al.  Discriminative feature weighting using MCE training for topic identification of spoken audio recordings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[24]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[25]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[26]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[27]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[28]  Dale Schuurmans,et al.  Text Classification in Asian Languages without Word Segmentation , 2003 .

[29]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[30]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[31]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[32]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[33]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[34]  Carla E. Brodley,et al.  Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers , 2006, TREC.

[35]  Richard K. Belew,et al.  Lexical dynamics and conceptual change: Analyses and implications for information retrieval , 2003 .

[36]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[37]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[38]  Honglak Lee,et al.  Spam Deobfuscation using a Hidden Markov Model , 2005, CEAS.

[39]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[40]  Susan T. Dumais,et al.  Newsjunkie: providing personalized newsfeeds via analysis of information novelty , 2004, WWW '04.

[41]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[42]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[43]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[44]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[45]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[46]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[47]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[48]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[49]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[50]  Carla E. Brodley,et al.  Advances in online learning-based spam filtering , 2008 .

[51]  George Forman,et al.  Learning from Little: Comparison of Classifiers Given Little Training , 2004, PKDD.

[52]  Blaz Zupan,et al.  Towards Practical PPM Spam Filtering: Experiments for the TREC 2006 Spam Track , 2006, TREC.

[53]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.

[54]  Thomas Gärtner,et al.  WBCsvm: Weighted Bayesian Classification based on Support Vector Machines , 2001, ICML.

[55]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[56]  T. Cover,et al.  A sandwich proof of the Shannon-McMillan-Breiman theorem , 1988 .

[57]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[58]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[59]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[60]  Gabriel Webster Improving letter-to-pronunciation accuracy with automatic morphologically-based stress prediction , 2004, INTERSPEECH.

[61]  Roland Kuhn,et al.  Automatic methods for lexical stress assignment and syllabification , 2000, INTERSPEECH.

[62]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[63]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[64]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[65]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[66]  Cor J. Veenman,et al.  Forensic Authorship Attribution Using Compression Distances to Prototypes , 2009, IWCF.

[67]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[68]  Xiangde Zhang,et al.  Use of the Burrows–Wheeler similarity distribution to the comparison of the proteins , 2010, Amino Acids.

[69]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[70]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[71]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[72]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[73]  Brockway McMillan,et al.  Two inequalities implied by unique decipherability , 1956, IRE Trans. Inf. Theory.

[74]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[75]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[76]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[77]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[78]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[79]  Stan Matwin,et al.  Intrinsic Plagiarism Detection using Complexity Analysis , 2009 .

[80]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[81]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.

[82]  William S. Yerazunis,et al.  CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track , 2005, TREC.

[83]  Johan Hovold,et al.  Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds , 2005, CEAS.

[84]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[85]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[86]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[87]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[88]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[89]  Jerneja Zganec-Gros,et al.  Slovenian Text-to-Speech Synthesis for Speech User Interfaces , 2005, WEC.

[90]  Vera Demberg,et al.  Phonological Constraints and Morphological Preprocessing for Grapheme-to-Phoneme Conversion , 2007, ACL.

[91]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[92]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[93]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[94]  Yimin Wu,et al.  Three Non-Bayesian Methods of Spam Filtration: CRM114 at TREC 2007 , 2007, TREC.

[95]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[96]  B. Hayes How many ways can you spell V1@gra? , 2007 .

[97]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[98]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[99]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[100]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[101]  Fidelis Assis OSBF-Lua - A Text Classification Module for Lua: The Importance of the Training Method , 2006, TREC.

[102]  Robert E. Schapire,et al.  Predicting Nearly as Well as the Best Pruning of a Decision Tree , 1995, COLT.

[103]  David L. Dowe,et al.  Message Length as an Effective Ockham's Razor in Decision Tree Induction , 2001, International Conference on Artificial Intelligence and Statistics.

[104]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[105]  Paul M. B. Vitányi,et al.  Kolmogorov Complexity and Information Theory. With an Interpretation in Terms of Questions and Answers , 2003, J. Log. Lang. Inf..

[106]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[107]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[108]  Jean-Philippe Vert,et al.  The context-tree kernel for strings , 2005, Neural Networks.

[109]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[110]  D. Sculley,et al.  Filtering Email Spam in the Presence of Noisy User Feedback , 2008, CEAS.

[111]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[112]  Geoff Holmes,et al.  Correcting English text using PPM models , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[113]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[114]  Marcus Hutter,et al.  Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[115]  W. Teahan Probability estimation for PPM , 1995 .

[116]  Huan Liu,et al.  Data Reduction via Instance Selection , 2001 .

[117]  James Allan,et al.  Extracting significant time varying features from text , 1999, CIKM '99.

[118]  Dragos Burileanu,et al.  A statistical approach to lexical stress assignment for TTS synthesis , 2009, Int. J. Speech Technol..

[119]  Mark Levene,et al.  A suffix tree approach to anti-spam email filtering , 2006, Machine Learning.

[120]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[121]  William S. Yerazunis Seven Hypothesis about Spam Filtering , 2006, TREC.

[122]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[123]  SˇEF TOMAZˇ,et al.  DATA MINING FOR CREATING ACCENTUATION RULES , 2004 .

[124]  Jerneja Zganec-Gros,et al.  SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian , 2006, Informatica.

[125]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[126]  Leon Gordon Kraft,et al.  A device for quantizing, grouping, and coding amplitude-modulated pulses , 1949 .

[127]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[128]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[129]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[130]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[131]  Thomas Richard Lynam,et al.  Spam Filter Improvement Through Measurement , 2009 .

[132]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[133]  Richard Segal,et al.  IBM SpamGuru on the TREC 2005 Spam Track , 2005, TREC.

[134]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[135]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[136]  Tony Andrew Meyer A TREC Along the Spam Track with SpamBayes , 2005, TREC.

[137]  Isidore Rigoutsos,et al.  Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM) , 2004, CEAS.

[138]  Charles L. A. Clarke,et al.  Using dynamic markov compression to detect vandalism in the wikipedia , 2009, SIGIR.

[139]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[140]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[141]  Paul G. Howard,et al.  The design and analysis of efficient lossless data compression systems , 1993 .

[142]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[143]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[144]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[145]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[146]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[147]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[148]  Bogdan Filipic,et al.  Exploiting structural information for semi-structured document categorization , 2006, Inf. Process. Manag..

[149]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[150]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[151]  Joshua Goodman,et al.  Online Discriminative Spam Filter Training , 2006, CEAS.

[152]  Gordon V. Cormack,et al.  Batch and Online Spam Filter Comparison , 2006, CEAS.

[153]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[154]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[155]  David Evans,et al.  Tracking and summarizing news on a daily basis with Columbia's Newsblaster , 2002 .

[156]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[157]  Murat Kantarcioglu,et al.  Compression for Anti-Adversarial Learning , 2011, PAKDD.

[158]  Luiz Eduardo Soares de Oliveira,et al.  Author Identification Using Compression Models , 2022 .

[159]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[160]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[161]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[162]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[163]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[164]  Georgios Paliouras,et al.  Filtron: A Learning-Based Anti-Spam Filter , 2004, CEAS.

[165]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[166]  Ian H. Witten,et al.  Adaptive text mining: inferring structure from sequences , 2004, J. Discrete Algorithms.

[167]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[168]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[169]  Chris S. Wallace,et al.  The Complexity of Strict Minimum Message Length Inference , 2002, Comput. J..

[170]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[171]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[172]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[173]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[174]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[175]  Ko Fujimura,et al.  Tweet classification by data compression , 2011, DETECT '11.

[176]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[177]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[178]  Gordon V. Cormack University of Waterloo Participation in the TREC 2007 Spam Track , 2007, TREC.

[179]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[180]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[181]  R. Solomonoff A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE , 2001 .

[182]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .

[183]  Shyhtsun Felix Wu,et al.  On Attacking Statistical Spam Filters , 2004, CEAS.

[184]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[185]  Ming Li,et al.  Inductive Reasoning and Kolmogorov Complexity , 1992, J. Comput. Syst. Sci..

[186]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[187]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[188]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[189]  Yan Zhou,et al.  A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters , 2008, J. Mach. Learn. Res..

[190]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[191]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[192]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[193]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[194]  Laurence A. F. Park Bootstrap confidence intervals for Mean Average Precision , 2011 .

[195]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[196]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[197]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[198]  Yan Zhou,et al.  Malware detection using adaptive data compression , 2008, AISec '08.

[199]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[200]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[201]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[202]  Seunghak Lee,et al.  Dynamically Weighted Hidden Markov Model for Spam Deobfuscation , 2007, IJCAI.

[203]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[204]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[205]  L. A. Breyer DBACL at the TREC 2005 , 2005, TREC.

[206]  Pedro M. Domingos The Role of Occam's Razor in Knowledge Discovery , 1999, Data Mining and Knowledge Discovery.

[207]  Gordon V. Cormack,et al.  Online supervised spam filter evaluation , 2007, TOIS.

[208]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[209]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[210]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[211]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[212]  Junyu Niu,et al.  WIM at TREC 2007 , 2007, TREC.

[213]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[214]  Ryan Thomas,et al.  Grapheme to phoneme conversion and dictionary verification using graphonemes , 2003, INTERSPEECH.

[215]  Maja Skrjanc,et al.  Automatic Lexical Stress Assignment of Unknown Words for Highly Inflected Slovenian Language , 2002, TSD.

[216]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[217]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[218]  James F. Allen,et al.  Bi-directional conversion between graphemes and phonemes using a joint N-gram model , 2001, SSW.

[219]  Zaher Dawy,et al.  Implementing the context tree weighting method for content recognition , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[220]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[221]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[222]  Brigham Anderson,et al.  Active learning for Hidden Markov Models: objective functions and algorithms , 2005, ICML.

[223]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[224]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[225]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[226]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[227]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[228]  Hermann Ney,et al.  Investigations on joint-multigram models for grapheme-to-phoneme conversion , 2002, INTERSPEECH.

[229]  Dragomir R. Radev,et al.  NewsInEssence: summarizing online news topics , 2005, Commun. ACM.

[230]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[231]  Ian H. Witten,et al.  Text mining: a new frontier for lossless compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[232]  Matjaz Gams,et al.  Analysis of Automatic Stress Assignment in Slovene , 2009, Informatica.

[233]  Simson L. Garfinkel,et al.  Stopping Spam , 1998 .

[234]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[235]  A. Bratko,et al.  Comparison between Humans and Machines on the Task of Accentuation of Slovene Words , 2005 .

[236]  Man Lan,et al.  Initialization of cluster refinement algorithms: a review and comparative study , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[237]  Gordon V. Cormack,et al.  Statistical precision of information retrieval evaluation , 2006, SIGIR.

[238]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[239]  Bogdan Filipic,et al.  Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track , 2005, TREC.

[240]  Grzegorz Kondrak,et al.  A Ranking Approach to Stress Prediction for Letter-to-Phoneme Conversion , 2009, ACL/IJCNLP.

[241]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[242]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[243]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .