Semantic Scoring Based on Small-World Phenomenon for Feature Selection in Text Mining

This paper proposes an effective scoring scheme for feature selection in Text Mining, using characteristics of Small-World Phenomenon on the semantic networks of documents. Our focus is on the reservation of both syntactic and statistical information of words, rather than solely simple frequency summarization in prevailing scoring schemes, such as TFIDF. Experimental results on TREC dataset show that our scoring scheme outperforms the prevailing schemes.

[1]  J. B. Keith Humphreys PhraseRate : An HTML Keyphrase Extractor ∗ , 2002 .

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[6]  Mark Newman,et al.  The structure and function of networks , 2002 .

[7]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[8]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[9]  Mark Steyvers,et al.  The Large-Scale Structure of Semantic Networks , 2005 .

[10]  Cai Qingsheng,et al.  Automatic keywords extraction of Chinese document using small world structure , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[11]  Tian Yong-hong,et al.  Towards a multilingual, multimedia and multimodal digital library platform , 2005 .

[12]  S. M.G. Caldeira,et al.  The network of concepts in written texts , 2006 .

[13]  Wang Chun-li,et al.  Towards a multilingual, multimedia and multimodal digital library platform , 2005 .

[14]  Chrystopher L. Nehaniv,et al.  Entropy Indicators for Investigating Early Language Processes , 2005 .

[15]  Peter Bruza,et al.  Towards context sensitive information inference , 2003, J. Assoc. Inf. Sci. Technol..

[16]  V Latora,et al.  Efficient behavior of small-world networks. , 2001, Physical review letters.

[17]  Mariano Sigman,et al.  Global organization of the Wordnet lexicon , 2001, Proceedings of the National Academy of Sciences of the United States of America.