Detecting Basic Level Categories by Term Weighting and Feature Entropy

With the explosive growth of wide variety resources in the real world, data structure mining becomes a meaningful subject. In cognitive psychology, there is a family of categories called basic level categories. This method can reflect natural categories of corpus faithfully. These categories represent the most nature level; neither too general nor too specific. People frequently prefer to use basic level concepts in their daily life. Basic level concepts are the abstraction of basic level categories. According to the study of cognitive psychology, we find that basic level categories play an important role in structural hierarchy relationship for human to understand. Existing methods can find out basic level categories in corpus but cannot work in continuous datasets. This paper proposed a method which can improve the similarity representation of category utility and help finding basic level categories not only in text datasets but also in continuous datasets. Our experiments demonstrate that our method has good performance in both two kinds of datasets than mainstream model.

[1]  Jian Li,et al.  Ranking continuous probabilistic datasets , 2010, Proc. VLDB Endow..

[2]  Tao Wang,et al.  Entropy-Based Term Weighting Schemes for Text Categorization in VSM , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[3]  Aïcha Mokhtari,et al.  Combining supervised term-weighting metrics for SVM text classification with extended term representation , 2016, Knowledge and Information Systems.

[4]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[5]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[6]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[7]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[8]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Milos Hauskrecht,et al.  Boosting KNN text classification accuracy by using supervised term weighting schemes , 2009, CIKM.

[10]  Radim Belohlávek,et al.  Basic Level in Formal Concept Analysis: Interesting Concepts and Psychological Ramifications , 2013, IJCAI.

[11]  John Q. Gan,et al.  A new term weighting scheme based on class specific document frequency for document representation and classification , 2015, 2015 7th Computer Science and Electronic Engineering Conference (CEEC).

[12]  Ho-fung Leung,et al.  Context-Aware Basic Level Concepts Detection in Folksonomies , 2010, WAIM.

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[15]  Christophe Moulin,et al.  Entropy based feature selection for text categorization , 2011, SAC.

[16]  Raymond Y. K. Lau,et al.  Context-aware ontologies generation with basic level concepts from collaborative tags , 2016, Neurocomputing.

[17]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[18]  Paul Rayson,et al.  Extending the Cochran rule for the comparison of word frequencies between corpora , 2004 .

[19]  Maurice Roux,et al.  A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms , 2018, Journal of Classification.

[20]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[21]  George Karypis,et al.  Document Clustering: The Next Frontier , 2018, Data Clustering: Algorithms and Applications.

[22]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .