论文信息 - Text categorization and similarity analysis: similarity measure, architecture and design

Text categorization and similarity analysis: similarity measure, architecture and design

This research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required.

[1] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[2] Sadhan Sood,et al. Probabilistic Simhash Matching , 2012 .

[3] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[4] Johannes Knopp. Classification of Named Entities in a large multilingual resource using the Wikipedia category system , 2010 .

[5] Ian H. Witten,et al. Clustering Documents with Active Learning Using Wikipedia , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6] Ian H. Witten,et al. Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[7] Luis Gravano,et al. dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.