Grouping Web Pages about Persons and Organizations for Information Extraction

Information extraction on the Web permits users to retrieve specific information on a person or an organization. As names are non-unique, the same name may be mapped to multiple entities. The aim of this paper is to describe an algorithm to cluster Web pages returned by search engines so that pages belonging to different entities are clustered into different groups. The algorithm uses named entities as the features to divide the document set into direct and indirect pages. It then uses distinct direct pages as seeds of clusters to group indirect pages into different clusters. The algorithm has been found to be effective for Web-based applications.

[1]  Hsinchun Chen,et al.  Document clustering for electronic meetings: an experimental comparison of two techniques , 1999, Decis. Support Syst..

[2]  Satoshi Sekine,et al.  Description of the Japanese NE System Used for MET-2 , 1998, MUC.

[3]  J. Leon Zhao,et al.  Automatic discovery of similarity relationships through Web mining , 2003, Decis. Support Syst..

[4]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[5]  Tat-Seng Chua,et al.  Learning pattern rules for Chinese named entity extraction , 2002, AAAI/IAAI.

[6]  Stuart K. Card,et al.  Information foraging in information access environments , 1995, CHI '95.

[7]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[8]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[9]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[10]  Dekang Lin Using Collocation Statistics in Information Extraction , 1998, MUC.

[11]  Tat-Seng Chua,et al.  An Agent-based Approach to Chinese Named Entity Recognition , 2002, COLING.

[12]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[13]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[14]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[15]  Gerald Salton,et al.  Automatic text processing , 1988 .

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Ralph Grishman,et al.  NYU: Description of the Proteus/PET System as Used for MUC-7 ST , 1998, MUC.