A TEXT CATEGORIZATION ON SEMANTIC ANALYSIS

Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from crops. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of crops-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r =0.56 to 0.75 for individual words and from r =0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. The proposed model showed enhanced precision and recall extraction values over other approaches.

[1]  José Luis Martínez-Fernández,et al.  Automatic Keyword Extraction for News Finder , 2003, Adaptive Multimedia Retrieval.

[2]  Shamkant B. Navathe,et al.  Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering , 2004 .

[3]  Anette Hulth,et al.  Automatic Keyword Extraction Using Domain Knowledge , 2001, CICLing.

[4]  Huilin Wang,et al.  Calculating Statistical Similarity between Sentences , 2011 .

[5]  Fernando Niño,et al.  Keyword extraction using an artificial immune system , 2007, GECCO '07.

[6]  Hichem Frigui,et al.  Simultaneous categorization of text documents and identification of cluster-dependent keywords , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[7]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[8]  Bin Wu,et al.  Automatic Keyword Extraction Using Linguistic Features , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[9]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[10]  Vishal Gupta,et al.  Effective Approaches For Extraction Of Keywords , 2010 .

[11]  Peter D. Turney Extraction of Keyphrases from Text: Evaluation of Four Algorithms , 2002, ArXiv.

[12]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[13]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[14]  Anette Hulth,et al.  Enhancing Linguistically Oriented Automatic Keyword Extraction , 2004, NAACL.

[15]  Daniel Barbará,et al.  Categorization and keyword identification of unlabeled documents , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).