论文信息 - On the Ranking of Text Documents from Large Corpuses

On the Ranking of Text Documents from Large Corpuses

Ranking text documents based on their relevance to a topic is of great importance in information retrieval. However, giving the increasingly available avalanche of digital documents, the size of collection pool from which these documents are drawn makes this task more challenging. In addition, current computing infrastructure is unable to deal with very large corpuses directly. Thus, new algorithms are needed to seek parallel solutions and utilize more processing power to solve this problem. In this paper we propose a new algorithm that partitions a large collection of documents (a corpus) into smaller corpuses that can each be handled by a single processor for the purpose of ranking. These multiple rankings are then merged together to provide a unified listing of all selected documents from the original large corpus.

Houssain Kettani | Gregory B. Newby

[1] Mario Rosario Guarracino,et al. A parallel block Lanczos algorithm and its implementation for the evaluation of some eigenvalues of large sparse symmetric matrices on multicomputers , 2006 .

[2] Ester M. Garzón,et al. Solving Eigenproblems on Multicomputers: Two Different Approaches , 2004 .

[3] Robert H. Halstead,et al. Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[4] Ellen M. Voorhees,et al. Overview of TREC 2007 , 2007, TREC.

[5] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6] Miles Efron,et al. Eigenvalue-based model selection during latent semantic indexing , 2005, J. Assoc. Inf. Sci. Technol..

[7] Sriram Raghavan,et al. Crawling the Hidden Web , 2001, VLDB.

[8] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[9] Jacques Savoy,et al. Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[10] Hsinchun Chen,et al. On the Topology of the Dark Web of Terrorist Groups , 2006, ISI.

[11] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[12] Michael B. Eisenberg,et al. A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[13] Gregory B. Newby,et al. Distributed Multisearch and Resource Selection for the TREC Million Query Track , 2008, TREC.