Domain-independent automatic keyphrase indexing with small training sets

Keyphrases are widely used in both physical and digital libraries as a brief but precise summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive, because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloguing rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.

[1]  Véronique Malaisé,et al.  A Method to Convert Thesauri to SKOS , 2006, ESWC.

[2]  Rosni Abdullah,et al.  Automatic Topic Identification Using Ontology Hierarchy , 2001, CICLing.

[3]  Christian Plaunt,et al.  An Association-Based Method for Automatic Indexing with a Controlled Vocabulary , 1998, J. Am. Soc. Inf. Sci..

[4]  Norbert Fuhr,et al.  Retrieval Test Evaluation of a Rule Based Automatic Index (AIR/PHYS) , 1984, SIGIR.

[5]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Claire David,et al.  Inedxing as Problem Solving: A Cognitive Approach to Consistency , 2013 .

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[10]  P. Zunde,et al.  Indexing Consistency and Quality. , 1969 .

[11]  Loll N. Rolling Indexing consistency, quality and efficiency , 1981, Inf. Process. Manag..

[12]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[13]  Grigory Begelman,et al.  Automated Tag Clustering: Improving search and exploration in the tag space , 2006 .

[14]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[15]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[16]  Koraljka Golub,et al.  Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations , 2006, New Rev. Hypermedia Multim..

[17]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Stefan Schulz,et al.  Interlingual Indexing across Different Languages , 2004, RIAO.

[20]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Jarmo Saarti,et al.  Consistency of subject indexing of novels by public library professionals and patrons , 2002, J. Documentation.

[23]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[24]  W. J. Black,et al.  A three-pronged approach to the extraction of key terms and semantic roles , 2003 .

[25]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[26]  Alan R. Aronson,et al.  Semi-Automatic Indexing of Full Text Biomedical Articles , 2005, AMIA.

[27]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.