论文信息 - Advanced Information Extraction with n-gram based LSI

Advanced Information Extraction with n-gram based LSI

Number of documents being created increases at an increasing pace while most of them being in already known topics and little of them introducing new concepts. This fact has started a new era in information retrieval discipline where the requirements have their own specialties. That is digging into topics and concepts and finding out subtopics or relations between topics. Up to now IR researches were interested in retrieving documents about a general topic or clustering documents under generic subjects. However these conventional approaches can’t go deep into content of documents which makes it difficult for people to reach to right documents they were searching. So we need new ways of mining document sets where the critic point is to know much about the contents of the documents. As a solution we are proposing to enhance LSI, one of the proven IR techniques by supporting its vector space with n-gram forms of words. Positive results we have obtained are shown in two different application area of IR domain; querying a document database, clustering documents in the document database. Keywords—Document clustering, Information Extraction, Information Retrieval, LSI,n-gram.

[1] Peter Willett,et al. Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[2] Elizabeth R. Jessup,et al. Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[3] W. Bruce Croft,et al. Corpus-Specific Stemming using Work Form Co-occurrence , 1994 .

[4] Daniel Boley,et al. Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[5] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[6] Marc El-Bèze,et al. A Clustering Method for Information Retrieval , 1999 .

[7] David G. Stork,et al. Pattern Classification , 1973 .

[8] Teuvo Kohonen,et al. The self-organizing map , 1990 .

[9] Michael F. Lynch,et al. Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[10] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[11] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[12] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.