Text Categorization Improvement via User Interaction

In this paper, we propose an approach to improvement of text categorization using interaction with the user. The quality of categorization has been defined in terms of a distribution of objects related to the classes and projected on the self-organizing maps. For the experiments, we use the articles and categories from the subset of Simple Wikipedia. We test three different approaches for text representation. As a baseline we use Bag-of-Words with weighting based on Term Frequency-Inverse Document Frequency that has been used for evaluation of neural representations of words and documents: Word2Vec and Paragraph Vector. In the representation, we identify subsets of features that are the most useful for differentiating classes. They have been presented to the user, and his or her selection allow increase the coherence of the articles that belong to the same category and thus are close on the SOM.

[1]  Julian Szymanski,et al.  Simulation of parallel similarity measure computations for large data sets , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  James Blustein,et al.  Interactive feature selection for document clustering , 2011, SAC.

[4]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[7]  Cvetana Krstev,et al.  Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources , 2017, Trans. Comput. Collect. Intell..

[8]  F. Mörchen,et al.  ESOM-Maps : tools for clustering , visualization , and classification with Emergent SOM , 2005 .

[9]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[10]  Abhay Harpale,et al.  Document Classification Through Interactive Supervision of Document and Term Labels , 2004, PKDD.

[11]  Julian Szymanski,et al.  Self Organizing Maps for Visualization of Categories , 2012, ICONIP.

[12]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hema Raghavan,et al.  InterActive Feature Selection , 2005, IJCAI.

[14]  Julian Szymanski,et al.  External Validation Measures for Nested Clustering of Text Documents , 2011, ISMIS Industrial Session.

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Julian Szymanski,et al.  Semantic Memory Knowledge Acquisition Through Active Dialogues , 2007, 2007 International Joint Conference on Neural Networks.

[18]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[19]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[20]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Shuib Liyana,et al.  Automatic Text Classification of ICD-10 Related CoD from Complex and Free Text Forensic Autopsy Reports , 2016 .

[23]  Julian Szymanski,et al.  Self-Organizing Map Representation for Clustering Wikipedia Search Results , 2011, ACIIDS.

[24]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[25]  Kapil Sharma,et al.  A comparative study of various text mining techniques , 2015, 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom).

[26]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[27]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[28]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[29]  Gintautas Dzemyda,et al.  Multidimensional Data Visualization , 2013 .

[30]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[31]  Alfred Ultsch,et al.  Emergence in Self Organizing Feature Maps , 2007 .

[32]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.