Correlation Analysis

Document (text) classification is a common method in e-business, facilitating users in the tasks such as document collection, analysis, categorization and storage. Semantic analysis can help to improve the performance of document classification. Though having been considered when designing previous methods for automatic document classification, more focus should be given to semantics with the increase number of content-rich electronic documents, forum posts or blogs online, which can reduce human workload by a great margin. This paper proposes a novel semantic document classification approach aiming to resolve two types of semantic problems: (1) polysemy problem, by using a novel semantic similarity computing strategy (SSC) and (2) synonym problem, by proposing a novel strong correlation analysis method (SCM). Experiments show that our strategies can help to improve the performance of the baseline methods.

[1]  Jun Fang,et al.  Ontology-Based Automatic Classification and Ranking for Web Documents , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[2]  Qiang Dong,et al.  Hownet And The Computation Of Meaning , 2006 .

[3]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[4]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[5]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[6]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[7]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[8]  Xiao-Jing Wang,et al.  A new approach to feature selection in text classification , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[9]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[10]  Rolph E. Anderson,et al.  An Introduction to Applied Multivariate Statistics , 1974 .

[11]  Ying Liu,et al.  Using WordNet to Disambiguate Word Senses for Text Classification , 2007, International Conference on Computational Science.

[12]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[13]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.