Tagging documents using neural networks based on local word features

Keywords and key-phrases that concisely represent text documents are integral to many knowledge management and text information retrieval systems, as well as digital libraries in general. Not all text documents, however, are annotated with good keywords; and the quality of these keywords is often dependent on a tedious, sometimes manual, extraction and tagging process. To automatically extract high quality keywords without the need for a semantic analysis of the document, it is shown that artificial neural networks (ANN) can be trained to only consider in-document word features such as word frequency, word distribution in document, use of word in special parts of the document, and use of word formatting features (i.e. bold-faced, italicized, large-font size). Results show that purely local features are adequate in determining whether a word in a document is a keyword or not. Classification performance yields a G mean of a least 0.83, and weighted f-measure of 0.96 for both keywords and non-keywords. Precision for keywords alone, however, is not as high. To understand the basis for classifying keywords, C4.5 is used to extract rules from the ANN. The extracted rules from C4.5, in the form of a decision tree, show the relative importance of the different document features that were extracted.

[1]  Paul Thompson,et al.  A combination of expert opinion approach to probabilistic information retrieval, part 1: The conceptual model , 1990, Inf. Process. Manag..

[2]  Tommy W. S. Chow,et al.  A new document representation using term frequency and vectorized graph connectionists with application to document retrieval , 2009, Expert Syst. Appl..

[3]  Hajo Hippner,et al.  Text Mining , 2006, Informatik-Spektrum.

[4]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[5]  Taeho Jo,et al.  Keyword Extraction from Documents Using a Neural Network Model , 2006, 2006 International Conference on Hybrid Information Technology.

[6]  Bart Baesens,et al.  Recursive Neural Network Rule Extraction for Data With Mixed Attributes , 2008, IEEE Transactions on Neural Networks.

[7]  Kalliopi Zervanou UvT: The UvT Term Extraction System in the Keyphrase Extraction Task , 2010, SemEval@ACL.

[8]  Chunguo Wu,et al.  Data Preprocessing in SVM-Based Keywords Extraction from Scientific Documents , 2009, 2009 Fourth International Conference on Innovative Computing, Information and Control (ICICIC).

[9]  Arash Joorabchi,et al.  Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms , 2013, J. Inf. Sci..

[10]  Xiaoying Gao,et al.  DIKEA: Domain-Independent Keyphrase Extraction Algorithm , 2012, Australasian Conference on Artificial Intelligence.

[11]  Kyu-Baek Hwang,et al.  Keyphrase extraction in biomedical publications using mesh and intraphrase word co-occurrence information , 2011, DTMBIO '11.

[12]  Weijian Ni,et al.  Extracting Keyphrase Set with High Diversity and Coverage Using Structural SVM , 2012, APWeb.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[15]  Timo Honkela,et al.  Learning a taxonomy from a set of text documents , 2012, Appl. Soft Comput..

[16]  Rudy Setiono,et al.  Keyword extraction using backpropagation neural networks and rule extraction , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[17]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[18]  Min Zhang,et al.  An Automatic Online News Topic Keyphrase Extraction System , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  Juan-Zi Li,et al.  Keyword Extraction Using Support Vector Machine , 2006, WAIM.

[21]  Chin-Chuan Han,et al.  GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System , 2007, ISPA Workshops.

[22]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[23]  Louis Massey,et al.  Autonomous and Adaptive Identification of Topics in Unstructured Text , 2011, KES.

[24]  Arnulfo P. Azcarraga,et al.  Extracting meaningful labels for WEBSOM text archives , 2001, CIKM '01.

[25]  Jeffrey Heer,et al.  Replication of the Keyword Extraction part of the paper "'Without the Clutter of Unimportant Words': Descriptive Keyphrases for Text Visualization" , 2019, ArXiv.

[26]  K. Srinathan,et al.  Automatic keyphrase extraction from scientific documents using N-gram filtration technique , 2008, ACM Symposium on Document Engineering.

[27]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[28]  Ramiz M. Aliguliyev,et al.  Clustering of document collection - A weighting approach , 2009, Expert Syst. Appl..

[29]  Jiang-Liang Hou,et al.  A knowledge component extraction technology using figures and tables , 2013, J. Exp. Theor. Artif. Intell..

[30]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[31]  Hiroyuki Goto,et al.  Efficient Scheduling Focusing on the Duality of MPL Representation , 2007, 2007 IEEE Symposium on Computational Intelligence in Scheduling.

[32]  Chun Chen,et al.  A Novel Approach to Keyword Extraction for Contextual Advertising , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[33]  Catherine Blake,et al.  Text mining , 2011, Annu. Rev. Inf. Sci. Technol..

[34]  Xuanjing Huang,et al.  Learning to Extract Coherent Keyphrases from Online News , 2011, AIRS.

[35]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[36]  Sheng-Tun Li,et al.  Constructing tree-based knowledge structures from text corpus , 2010, Applied Intelligence.

[37]  Miles Efron,et al.  Linear time series models for term weighting in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[38]  Timothy Baldwin,et al.  Automatic keyphrase extraction from scientific articles , 2013, Lang. Resour. Evaluation.

[39]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[40]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[41]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[42]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[43]  Roberto Tedesco,et al.  A novel semantic information retrieval system based on a three-level domain model , 2013, J. Syst. Softw..