Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree-Shape Structures

The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.

[1]  Nikos Tsirakis,et al.  XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries , 2008, SAC '08.

[2]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[3]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[4]  Yi Guo,et al.  Automatic text categorization based on content analysis with cognitive situation models , 2010, Inf. Sci..

[5]  Marie-Francine Moens,et al.  Representations for multi-document event clustering , 2012, Data Mining and Knowledge Discovery.

[6]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[7]  Andrea Tagarelli,et al.  Exploring dictionary-based semantic relatedness in labeled tree data , 2013, Inf. Sci..

[8]  Xiao Hua Chen,et al.  A WordNet-based semantic similarity measurement combining edge-counting and information content theory , 2015, Eng. Appl. Artif. Intell..

[9]  Juan Llorens Morillo,et al.  Towards an ontology-based retrieval of UML Class Diagrams , 2012, Inf. Softw. Technol..

[10]  Iryna Gurevych,et al.  Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words , 2009, Natural Language Engineering.

[11]  Jianpei Zhang,et al.  An overlapping semantic community detection algorithm base on the ARTs multiple sampling models , 2015, Expert Syst. Appl..

[12]  Lei Tang,et al.  Large scale multi-label classification via metalabeler , 2009, WWW '09.

[13]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[14]  Weiming Shen,et al.  An weighted ontology-based semantic similarity algorithm for web service , 2009, Expert Syst. Appl..

[15]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Gianni Costa,et al.  On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[18]  Gianni Costa,et al.  Hierarchical clustering of XML documents focused on structural components , 2013, Data Knowl. Eng..

[19]  Charu C. Aggarwal,et al.  XRules: An effective algorithm for structural classification of XML data , 2006, Machine Learning.

[20]  Yuh-Min Chen,et al.  A semantic-based approach to content abstraction and annotation for content management , 2009, Expert Syst. Appl..

[21]  Montserrat Batet,et al.  Utility preserving query log anonymization via semantic microaggregation , 2013, Inf. Sci..

[22]  Vicenç Torra,et al.  On the protection of social networks user's information , 2013, Knowl. Based Syst..

[23]  David Sánchez,et al.  Automatic extraction of acronym definitions from the Web , 2011, Applied Intelligence.