Lexical-semantic SLVM for XML Document Classification

Structured link vector model (SLVM) and its improved version depend on statistical term measures to implement XML document representation. As a result, they ignore the lexical semantics of terms and its mutual information, leading to text classification errors. This paper proposed a XML document representation method, WordNet-based lexical-semantic SLVM, to solve the problem. Using WordNet, this method constructed a data structure for characterizing lexical semantic contents of XML document, and adjusted EM modeling to disambiguate word stems. Then, synset matrix of lexical semantic contents was built in the lexical-semantic feature space for XML document representation, and lexical semantic relations were marked on it to construct the feature matrix in lexical-semantic SLVM. On categorized dataset of Wikipedia XML, using NWKNN classification algorithm, the experimental results show that the feature matrix of our method performs F1 measure better than original SLVM and frequent sub-tree SLVM based on TF-IDF.

[1]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[2]  Jianwu Yang,et al.  Extended VSM for XML Document Classification Using Frequent Subtrees , 2009, INEX.

[3]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[4]  Jianwu Yang,et al.  A semi-structured document model for text mining , 2008, Journal of Computer Science and Technology.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Chen Xiaoou,et al.  A semi-structured document model for text mining , 2002 .

[7]  Vasile Rus,et al.  Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics , 2012, FLAIRS Conference.

[8]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[9]  C. Keylock Simpson diversity and the Shannon–Wiener index as special cases of a generalized entropy , 2005 .

[10]  William Kwok-Wai Cheung,et al.  Learning element similarity matrix for semi-structured document analysis , 2008, Knowledge and Information Systems.

[11]  Chang Choi,et al.  Automatic Enrichment of Semantic Relation Network and Its Application to Word Sense Disambiguation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Manuel Palomar,et al.  A Maximum Entropy-based Word Sense Disambiguation System , 2002, COLING.

[13]  David Sánchez,et al.  A semantic similarity method based on information content exploiting multiple ontologies , 2013, Expert Syst. Appl..