Suffix Tree Clustering with Named Entity Recognition

The news searching is challengeable in providing web users with clear and readable lists of news reports. This paper proposes the Suffix Tree Clustering with Named Entity Recognition (STC-NER). STC-NER is supposed to cluster news searching results returned by the search engine. STC-NER uses the snippets returned from the searching results and then derives patterned information by means of named entity recognition. STC-NER makes a great contribute to the reduction of storage as well as the time complexity. Experiments show that STC-NER has a better performance in precision and efficiency than the traditional Suffix Tree Clustering (STC).

[1]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[2]  Dongsong Zhang,et al.  NLPIR: a Theoretical Framework for Applying Natural Language Processing to Information Retrieval , 2003, J. Assoc. Inf. Sci. Technol..

[3]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[4]  Hao Tian-yong The State of the Art and Difficulties in Automatic Chinese Word Segmentation , 2005 .

[5]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[6]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[7]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[8]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[9]  M. Rafi,et al.  A comparison of two suffix tree-based document clustering algorithms , 2010, 2010 International Conference on Information and Emerging Technologies.

[10]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[11]  Antonio Gulli,et al.  The anatomy of a news search engine , 2005, WWW '05.

[12]  Yang Jian-wu A Chinese Web page clustering algorithm based on the suffix tree , 2008, Wuhan University Journal of Natural Sciences.

[13]  J. Farradane,et al.  Information Science , 1971, Nature.

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[16]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.