HMM-based passage models for document classification and ranking

We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant documents may contain irrelevant passages; the originality of the model is that it does not explicitly segment documents but rather considers all possible segmentations in its final score. This model generalizes the multinomial Naive Bayes and it is derived from a more general model for different access tasks. The model is evaluated on the REUTERS test collection and compared to the multinomial Naive Bayes model. It is shown to be more robust with respect to the training set size and to improve the performance both for ranking and classification, specially for classes with few training examples.

[1]  S. Wermter,et al.  Recurrent neural network learning for text routing , 1999 .

[2]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[3]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  M. Sanderson The Reuters collection , 1994 .

[6]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[7]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[8]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[9]  Hugo Zaragoza Modeles dynamiques d'apprentissage numerique pour l'acces a l'information textuelle , 1999 .

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Donna K. Harman,et al.  The DARPA TIPSTER project , 1992, SIGF.

[12]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[13]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[14]  Peter Schäuble,et al.  Document and passage retrieval based on hidden Markov models , 1994, SIGIR '94.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[18]  Massih-Reza Amini,et al.  Learning for Sequence Extraction Tasks , 2000, RIAO.