Text Mining Using Markov Chains of Variable Length

When dealing with knowledge federation over text documents one has to figure out whether or not documents are related by context. A new approach is proposed to solve this problem. This leads to the design of a new search engine for literature research and related problems. The idea is that one has already some documents of interest. These documents are taken as input. Then all documents known to a classical search engine are ranked according to their relevance. For achieving this goal we use Markov chains of variable length. The algorithms developed have been implemented and testing over the Reuters-21578 data set has been performed.

[1]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[2]  Lutz Dümbgen Stochastik für Informatiker , 2003 .

[3]  Karen Spärck Jones,et al.  Natural language processing for information retrieval , 1996, CACM.

[4]  Dan Roth,et al.  Understanding Probabilistic Classifiers , 2001, ECML.

[5]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[6]  S. Robertson The probability ranking principle in IR , 1997 .

[7]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[8]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[9]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[10]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[13]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[14]  J. Doob Stochastic processes , 1953 .

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[17]  Michael A. Harrison,et al.  Introduction to formal language theory , 1978 .

[18]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[19]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[20]  Naftali Tishby,et al.  Discriminative Feature Selection via Multiclass Variable Memory Markov Model , 2002, EURASIP J. Adv. Signal Process..