Beyond the Bag of Words: A Text Representation for Sentence Selection

Sentence selection shares some but not all the characteristics of Automatic Text Categorization. Therefore some but not all the same techniques should be used. In this paper we study a syntactic and semantic enriched text representation for the sentence selection task in a genomics corpus. We show that using technical dictionaries and syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Furthermore, the syntactic relations can be used by a first order rule learner to obtain even better performance.

[1]  Mohamed Ould Abdel Vetah Apprentissage automatique appliqué à l'extraction d'information à partir de textes biologiques , 2005 .

[2]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[3]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[4]  Johannes Fürnkranz Inductive Logic Programming (A Short Introduction and a Thesis Abstract) , 1994 .

[5]  Christian Jacquemin,et al.  What is the Tree that we see Through the Window: A Linguistic Approach to Windowing and Term Variation , 1996, Inf. Process. Manag..

[6]  Stefan Kramer Relational learning vs. propositionalization , 2000 .

[7]  Emmon W. Bach,et al.  Universals in Linguistic Theory , 1970 .

[8]  Amita Goyal Chin Text Databases and Document Management: Theory and Practice , 2000 .

[9]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[10]  SingerYoram,et al.  Context-sensitive learning methods for text categorization , 1999 .

[11]  Claire Grover,et al.  Sequence modelling for sentence classification in a legal summarisation system , 2005, SAC '05.

[12]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[13]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[14]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[15]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[16]  Ian Witten,et al.  Data Mining , 2000 .

[17]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[18]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[19]  Sarah Zelikovitz,et al.  Improving Text Classification with LSI Using Background Knowledge , 2007 .

[20]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[21]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[22]  Jude W. Shavlik,et al.  Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction , 2004, ILP.

[23]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[24]  Claire Nedellec,et al.  Sentence Filtering for Information Extraction in Genomics, a Classification Problem , 2001, PKDD.

[25]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[26]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[29]  Yoelle Maarek,et al.  GURU: Information Retrieval for Reuse , 1994 .

[30]  Georges Siolas Modèles probabilistes et noyaux pour l'extraction d'informations à partir de documents , 2003 .

[31]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[32]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.