Search Result Clustering Using Informatively Named Entities

Clustering the results of a search helps the user to review the information gathered. In this article, we regard the clustering task as indexing the search results. Here, an index means a structured label list that can make it easier for the user to comprehend the labels and search results. To realize this goal, we make three proposals. The first is to use Named Entity Extraction for term extraction. The second is to create a new label-selecting criterion based on importance in the search result and the relation between terms and search queries. The third is a label categorization using category information of labels, which is generated by named entity extraction. We implement a prototype system based on these proposals and find that it offers a much higher performance than existing methods; we focus on news articles in this article, but the system is not topic specific.

[1]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[2]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[3]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[4]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[5]  Kentaro Torisawa,et al.  Extracting Hyponyms of Prespecified Hypernyms from Itemizations and Headings in Web Documents , 2004, COLING.

[6]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[7]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[8]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[9]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[10]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[11]  Koji Eguchi Overview of the Topical Classification Task at NTCIR-4 WEB , 2004, NTCIR.

[12]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[13]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[14]  Henning Müller,et al.  Relevance Feedback and Term Weighting Schemes for Content-Based Image Retrieval , 1999, VISUAL.

[15]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[16]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  Shigeyoshi Ohno,et al.  Overlapping Clustering Method Using Local and Global Importance of Feature Terms at NTCIR-4 WEB Task , 2004, NTCIR.

[19]  Paolo Ferragina,et al.  The anatomy of a hierarchical clustering engine for Web-page, news and book snippets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[20]  Hiroyuki Seki,et al.  Flexible Category Structure for Supporting WWW Retrieval , 2000, ER.

[21]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[22]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[23]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.