Can back-of-the-book indexes be automatically created?

Automatic creation of back-of-the-book indexes remains one of the few manual tasks related to publishing. Inspired by how human indexers work on back-of-the-book indexes creation, we present a new domain-independent, corpus-free and training-free automation approach. Given a book, the index terms will be sequentially selected according to an indexability score encoded by the structure information residing in a book as well as a novel context-aware term informativeness measurement utilizing the power of the web knowledge base such as Wikipedia. By extensive experiments on books from various domains, we show our approach to be a more effective and practical than ones that used previous keyword extraction and supervised learning.

[1]  Adeline Nazarenko,et al.  Building back-of-the-book indexes , 2005 .

[2]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[3]  Rada Mihalcea,et al.  Linguistically Motivated Features for Enhanced Back-of-the-Book Indexing , 2008, ACL.

[4]  Joelle Pineau,et al.  Automatically suggesting topics for augmenting text documents , 2010, CIKM.

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[7]  Jiawei Han,et al.  Keyword extraction for social snippets , 2010, WWW '10.

[8]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[9]  Zhaohui Wu,et al.  Measuring Term Informativeness in Context , 2013, NAACL.

[10]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[11]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[12]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Kirill Kireyev,et al.  Semantic-based Estimation of Term Informativeness , 2009, NAACL.

[14]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[17]  Rada Mihalcea,et al.  Investigations in Unsupervised Back-of-the-Book Indexing , 2007, FLAIRS.

[18]  Katja Hofmann,et al.  The impact of document structure on keyphrase extraction , 2009, CIKM.

[19]  Virgil Diodato,et al.  Back of book indexes and the characteristics of author and nonauthor indexing: Report of an exploratory study , 1991, J. Am. Soc. Inf. Sci..

[20]  John Knowles Indexing Specialities: Law , 2002, J. Documentation.

[21]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[22]  Wei Wu,et al.  Automatic Generation of Personalized Annotation Tags for Twitter Users , 2010, NAACL.

[23]  Hinrich Schütze The hypertext concordance: a better back-of-the-book index , 1998 .

[24]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[25]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[26]  Zhaohui Wu,et al.  Searching online book documents and analyzing book citations , 2013, ACM Symposium on Document Engineering.