Design and Implementation of a Content-Based Search Engine

This thesis presents a system to nd interesting textual content among tens of millions of documents. This is made possible by a novel content-based ranking method and a simple, structured query interface, which are presented in this thesis. The ranking method allows the user to utilize the full co-occurrence matrix of all words in the corpus to bring out relevant material. The user may explicitly dene her conception of relevance by guiding the ranking with single words. This thesis presents the design and implementation of the system. The basic formulation of the content-based ranking method is computationally rather expensive and therefore also an ecient algorithm is given. The index structures of the system have been specically designed to support the ranking scheme. The system is distributable to a cluster of servers, allowing reasonable scalability. We present three real-world deployments of the system. The largest of the deployments was a publicly available Web search engine, Aino, which covered over four million pages in the .FI domain. Työn nimi Arbetets titel Title Oppiaine Läroämne Subject Työn laji Arbetets art Level Aika Datum Month and year Sivumäärä Sidoantal Number of pages Tiivistelmä Referat Abstract Avainsanat Nyckelord Keywords Säilytyspaikka Förvaringsställe Where deposited Muita tietoja övriga uppgifter Additional information

[1]  Alain Pirotte,et al.  Domain-Oriented Relational Languages , 1977, VLDB.

[2]  Boris Katz,et al.  A Comparative Study of Language Models for Book and Author Recognition , 2005, IJCNLP.

[3]  Ville H. Tuulos,et al.  GS textplorer -: adaptive framework for information retrieval , 2002, SIGIR '02.

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  A. D. Ritchie The Dictionary of Philosophy , 1945, Nature.

[6]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[7]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[8]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[9]  Robert R. Korfhage Theoretical measure in P/Q document spaces , 1985, SIGIR '85.

[10]  B. A. Moskowitz The acquisition of language. , 1978, Scientific American.

[11]  Wray Buntine,et al.  Topic-specific scoring of documents for relevant retrieval , 2005, ICML 2005.

[12]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[13]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[14]  Tomi Silander,et al.  LANGUAGE PRAGMATICS , CONTEXTS AND A SEARCH ENGINE , 2005 .

[15]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[16]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[17]  Henry Tirri,et al.  Multi-faceted information retrieval system for large scale email archives , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[18]  Henry Tirri,et al.  Combining Topic Models and Social Networks for Chat Data Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[19]  Henry A. Kautz,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[20]  Wray L. Buntine,et al.  Is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction? , 2003, AISTATS.

[21]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[22]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Jaana Kekäläinen,et al.  The impact of query structure and query expansion on retrieval performance , 1998, SIGIR '98.

[25]  N. Foo Conceptual Spaces—The Geometry of Thought , 2022 .

[26]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27]  Eric W. Brown,et al.  Fast evaluation of structured queries for information retrieval , 1995, SIGIR '95.

[28]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[29]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[30]  Miro Lehtonen,et al.  Indexing Heterogeneous XML for Full-Text Search , 2006 .

[31]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[32]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[33]  Timo Honkela,et al.  Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map , 1995 .

[34]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[35]  Wray Buntine,et al.  Using Discrete PCA on Web Pages , 2004 .

[36]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[37]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[38]  Joe Armstrong,et al.  Concurrent programming in ERLANG , 1993 .

[39]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[40]  Henry Tirri,et al.  A Scalable Topic-Based Open Source Search Engine , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[41]  H. V. Jagadish,et al.  NaLIX: an interactive natural language interface for querying XML , 2005, SIGMOD '05.

[42]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..