An Index Clustering and Mapping Algorithm for Large Scale Astronomical Data Searching

For large scale unstructured astronomical data documents, the simple index method often results in high communication cost and slow query processing. Based on the characteristics of domain specific astronomical data and the quantitative tracing and analyzing results, a query terms similarity calculation formula is provided. An index clustering algorithm is designed to generate many small clusters with high term association and small real index size which can be stored into different nodes as a whole. To keep high query locality and reasonable load balancing, a practical index mapping algorithm is proposed to map different logical index clusters onto physical nodes. The simulation results show that the algorithms provided in this paper have good scalability for large scale astronomical data index system. Compared with other methods, different queries can be distributed and located onto smaller number of nodes, so communication cost among different nodes can be reduced significantly and the search efficiency could be well improved.

[1]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007, Infoscale.

[2]  Paride Dagna IPDPS 2009: 23rd IEEE International Parallel & Distributed Processing Symposium , 2009 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[5]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[6]  Kenneth Ward Church,et al.  Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[7]  Mauricio Marín,et al.  Load balancing distributed inverted files , 2007, WIDM '07.

[8]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007 .

[9]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[10]  Karl Aberer,et al.  Query-driven indexing for peer-to-peer text retrieval , 2007, WWW '07.

[11]  Alistair Moffat,et al.  Load balancing for term-distributed parallel retrieval , 2006, SIGIR.

[12]  Hector Garcia-Molina,et al.  Performance of Inverted Indices in Distributed Text Document Retrieval Systems , 1993 .

[13]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[14]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[15]  Mauricio Marín,et al.  Scheduling Intersection Queries in Term Partitioned Inverted Files , 2008, Euro-Par.

[16]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[17]  Roberto J. Bayardo,et al.  Make it fresh, make it quick: searching a network of personal webservers , 2003, WWW '03.

[18]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[19]  Torsten Suel,et al.  Optimized Inverted List Assignment in Distributed Search Engine Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.