Automatic Extraction of Document Topics

A keyword or topic for a document is a word or multi-word (sequence of 2 or more words) that summarizes in itself part of that document content. In this paper we compare several statistics-based language independent methodologies to automatically extract keywords. We rank words, multi-words, and word prefixes (with fixed length: 5 characters), by using several similarity measures (some widely known and some newly coined) and evaluate the results obtained as well as the agreement between evaluators. Portuguese, English and Czech were the languages experimented.

[1]  Julio Gonzalo,et al.  Automatic Selection of Noun Phrases as Document Descriptors in an FCA-Based Information Retrieval System , 2005, ICFCA.

[2]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[3]  José Gabriel Pereira Lopes,et al.  Towards Automatic Building of Document Keywords , 2010, COLING.

[4]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  B. Everitt The Cambridge Dictionary of Statistics , 1998 .

[7]  George A. Miller,et al.  The science of words , 1991 .

[8]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[9]  José Luis Martínez-Fernández,et al.  Automatic Keyword Extraction for News Finder , 2003, Adaptive Multimedia Retrieval.

[10]  José Gabriel Pereira Lopes,et al.  A Document Descriptor Extractor Based on Relevant Expressions , 2009, EPIA.

[11]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[12]  Anette Hulth,et al.  Enhancing Linguistically Oriented Automatic Keyword Extraction , 2004, NAACL.

[13]  Ralph Grishman,et al.  Machine Learning of Extraction Patterns from Unannotated Corpora: Position Statement , 2000 .

[14]  A. Campbell,et al.  Progress in Artificial Intelligence , 1995, Lecture Notes in Computer Science.

[15]  Nina Wacholder,et al.  Spotting and Discovering Terms Through Natural Language Processing , 2003, Information Retrieval.

[16]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.