Learning Document Similarity Using Natural Language Processing

The recent considerable growth in the amount of easily available on-line text has brought to the foreground the need for large-scale natural language processing tools for text data mining. In this paper we address the problem of organizing documents into meaningful groups according to their content and to visualize a text collection, providing an overview of the range of documents and of their relationships, so that they can be browsed more easily. We use Self-Organizing Maps (SOMs) (Kohonen 1984). Great efficiency challenges arise in creating these maps. We study linguistically-motivated ways of reducing the representation of a document to increase efficiency and ways to disambiguate the words in the documents.

[1]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Eric Wehrli,et al.  L'analyse syntaxique des langues naturelles : problèmes et méthodes , 1997 .

[4]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[5]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[6]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[7]  Andreas Rauber,et al.  The SOMLib Digital Library System , 1999, ECDL.

[8]  Hinrich Schütze,et al.  Ambiguity resolution in language learning , 1997 .

[9]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[10]  E. Keenan,et al.  Noun Phrase Accessibility and Universal Grammar , 2008 .

[11]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[12]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[13]  Branimir K. Boguraev,et al.  Salience-based Content Characterisafion of Text Documents , 1997 .

[14]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[15]  Gerold Schneider,et al.  Using Syntactic Analysis to Increase Efficiency in Visualizing Text Collections , 2002, COLING.

[16]  Steven Abney,et al.  A computational model of human parsing , 1989 .

[17]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[18]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .