Advanced Metasearch Engine Technology

Among the search tools currently on the Web, search engines are the most well known thanks to the popularity of major search engines such as Google and Yahoo!. While extremely successful, these major search engines do have serious limitations. This book introduces large-scale metasearch engine technology, which has the potential to overcome the limitations of the major search engines. Essentially, a metasearch engine is a search system that supports unified access to multiple existing search engines by passing the queries it receives to its component search engines and aggregating the returned results into a single ranked list. A large-scale metasearch engine has thousands or more component search engines. While metasearch engines were initially motivated by their ability to combine the search coverage of multiple search engines, there are also other benefits such as the potential to obtain better and fresher results and to reach the Deep Web. The following major components of large-s ale metasearch engines will be discussed in detail in this book: search engine selection, search engine incorporation, and result merging. Highly scalable and automated solutions for these components are emphasized. The authors make a strong case for the viability of the large-scale metasearch engine technology as a competitive technology for Web search. Table of Contents: Introduction / Metasearch Engine Architecture / Search Engine Selection / Search Engine Incorporation / Result Merging / Summary and Future Research

[1]  Luo Si,et al.  A semisupervised learning method to merge search engine results , 2003, TOIS.

[2]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[3]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[4]  David R. Karger,et al.  Haystack: A Platform for Authoring End User Semantic Web Applications , 2003, WWW.

[5]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[6]  Milad Shokouhi,et al.  Federated Search , 2011, Found. Trends Inf. Retr..

[7]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[8]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[9]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[10]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[11]  King-Lup Liu,et al.  Evaluation of Result Merging Strategies for Metasearch Engines , 2005, WISE.

[12]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[13]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[14]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[15]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[16]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[17]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[18]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[19]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[20]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[21]  Udi Manber,et al.  Connecting Diverse Web Search Facilities , 1998, IEEE Data Eng. Bull..

[22]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[23]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[24]  Georg Gottlob,et al.  Supervised Wrapper Generation with Lixto , 2001, VLDB.

[25]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[26]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[27]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[28]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[29]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[30]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[31]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[32]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[33]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[34]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[35]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[36]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[37]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[38]  King-Lup Liu,et al.  A Statistical Method for Estimating the Usefulness of Text Databases , 2002, IEEE Trans. Knowl. Data Eng..

[39]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[40]  Clement T. Yu,et al.  Precision Weighting—An Effective Automatic Indexing Method , 1976, J. ACM.

[41]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[42]  H. Moulin Axioms of Cooperative Decision Making , 1988 .

[43]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[44]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[45]  David J. DeWitt,et al.  Computing PageRank in a Distributed Internet Search Engine System , 2004, VLDB.

[46]  Vijay V. Raghavan,et al.  Towards automatic incorporation of search engines into a large-scale metasearch engine , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[47]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[48]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[49]  King-Lup Liu,et al.  Discovering the representative of a search engine , 2001, CIKM '01.

[50]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[51]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[52]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[53]  Clement T. Yu,et al.  A highly scalable and effective method for metasearch , 2001, TOIS.

[54]  Yizhong Fan,et al.  Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources , 1999 .

[55]  Yu Chen,et al.  Html Page Analysis based on Visual cues , 2003, Web Document Analysis.

[56]  King-Lup Liu,et al.  A Methodology to Retrieve Text Documents from Multiple Databases , 2002, IEEE Trans. Knowl. Data Eng..

[57]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[58]  Weifeng Su,et al.  ODE: Ontology-assisted data extraction , 2009, TODS.

[59]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[60]  Clement T. Yu,et al.  Towards a highly-scalable and effective metasearch engine , 2001, WWW '01.

[61]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[62]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[63]  David Hawking,et al.  Result merging strategies for a current news metasearcher , 2003, Inf. Process. Manag..

[64]  Martijn Koster,et al.  ALIWEB - Archie-like Indexing in the WEB , 1994, Comput. Networks ISDN Syst..

[65]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[66]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[67]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[68]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[69]  Udi Manber,et al.  The Search Broker , 1997, USENIX Symposium on Internet Technologies and Systems.

[70]  Mounia Lalmas,et al.  Merging techniques for performing data fusion on the web , 2001, CIKM '01.

[71]  Luis Gravano,et al.  Modeling and managing changes in text databases , 2007, TODS.

[72]  King-Lup Liu,et al.  Detection of heterogeneities in a multiple text database environment , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[73]  Milad Shokouhi,et al.  Robust result merging using sample-based score estimates , 2009, TOIS.

[74]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[75]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[76]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[77]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[78]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[79]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[80]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[81]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[82]  Vijay V. Raghavan,et al.  AllInOneNews: development and evaluation of a large-scale news metasearch engine , 2007, SIGMOD '07.

[83]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[84]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[85]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[86]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[87]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[88]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[89]  King-Lup Liu,et al.  Efficient and effective metasearch for a large number of text databases , 1999, CIKM '99.