Multilingual single document keyword extraction for information retrieval

Keywords play an important role in many aspects of information retrieval (IR). From Web searches to text summarization good keywords are a necessity. In a typical IR system algorithms are used which require the entire document collection to be built beforehand. While some research has been done on extracting keywords from a single document, the quality of the keywords was not based on how well they perform in IR tasks. Moreover, they are designed for only one language and the applicability to other languages is unknown. As such, this paper proposes a new algorithm that is applicable to multiple languages and extracts effective keywords that, to a high degree, uniquely identify a document. It needs only a single document to extract keywords and does not rely on machine learning methods. It was tested on a Japanese-English bilingual corpus and a portion of the Reuter's corpus using a keyword search algorithm. The results show that the extracted keywords do a good job at uniquely identifying the documents.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Terry Winograd,et al.  Language as a Cognitive Process , 1983, CL.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[5]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[6]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[7]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[8]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[9]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[10]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[11]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[12]  Anette Hulth,et al.  Enhancing Linguistically Oriented Automatic Keyword Extraction , 2004, NAACL.

[13]  Annie S. Wu,et al.  Identification, Expansion, and Disambiguation of Acronyms in Biomedical Texts , 2005, ISPA Workshops.

[14]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .