Discovery of Novel Term Associations in a Document Collection

We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets. The model we propose, tpf---idf---tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf---idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user. We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf---idf---tpu method can discover novel associations, that they are different from just taking pairs of tf---idf keywords, and that they match better the subjective associations of a reader.

[1]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[2]  Tanja Urbancic,et al.  Literature mining method RaJoLink for uncovering relations between biomedical concepts , 2009, J. Biomed. Informatics.

[3]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[5]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[6]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Michael R. Berthold Bisociative Knowledge Discovery , 2011, IDA.

[9]  J. Novak The Theory Underlying Concept Maps and How To Construct Them , 2004 .

[10]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[11]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[12]  Tobias Kötter,et al.  From Information Networks to Bisociative Information Networks , 2012, Bisociative Knowledge Discovery.

[13]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[14]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Alberto J. Cañas,et al.  A TEORIA SUBJACENTE AOS MAPAS CONCEITUAIS E COMO ELABORÁ-LOS E USÁ-LOS * THE THEORY UNDERLYING CONCEPT MAPS AND HOW TO CONSTRUCT AND USE THEM , 2010 .

[18]  Christian Borgelt,et al.  Selecting the Links in BisoNets Generated from Document Collections , 2010, IDA.