A Document Descriptor Extractor Based on Relevant Expressions

People are often asked to associate keywords to documents to enable applications to access the summarized core content of documents. This fact was the main motivation to work on an approach that may contribute to move from this manual procedure to an automatic one. Since Relevant Expressions (REs) or multi-word term expressions can be automatically extracted using the LocalMaxs algorithm, the most relevant ones can be used to describe the core content of each document. In this paper we present a language-independent approach for automatic generation of document descriptors. Results are shown for three different European languages and comparisons are made concerning different metrics for selecting the most informative REs of each document.

[1]  A. Campbell,et al.  Progress in Artificial Intelligence , 1995, Lecture Notes in Computer Science.

[2]  Roberto Basili,et al.  Identification of Relevant Terms to Support the Construction of Domain Ontologies , 2001, HTLKM@ACL.

[3]  Ralph Grishman,et al.  Machine Learning of Extraction Patterns from Unannotated Corpora: Position Statement , 2000 .

[4]  Julio Gonzalo,et al.  Automatic Selection of Noun Phrases as Document Descriptors in an FCA-Based Information Retrieval System , 2005, ICFCA.

[5]  Christian Jacquemin,et al.  Spotting and Discovering Terms through Natural Language Processing , 1997 .

[6]  José Luis Martínez-Fernández,et al.  Automatic Keyword Extraction for News Finder , 2003, Adaptive Multimedia Retrieval.

[7]  David E. Millard,et al.  Automatic extraction of knowledge from web documents , 2003 .

[8]  Alexiei Dingli,et al.  Mining web sites using adaptive information extraction , 2003 .

[9]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[10]  Bernadette Bouchon-Meunier,et al.  Enhanced web document summarization using hyperlinks , 2003, HYPERTEXT '03.

[11]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Ramiz M. Aliguliyev A Novel Partitioning-Based Clustering Method and Generic Document Summarization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[14]  Alexiei Dingli,et al.  Mining Web Sites Using Unsupervised Adaptive Information Extraction , 2003, EACL.