Analysis and optimization of question answering systems

!oueSllon Answering (OA) syslems are 'ppHcaHons that search Info<matlon on large cOlleeHons of documenls, whIch are enhanced with nalurat ,language processing capabilities. This additional processing adds an important cost 10 the system,and thus it Is necessary to find techniques that improve the time performance of the system. This thesis is focused on how to improve the performance of OA systems. in either a single computer or in 'distributed systems, proposing new generic techniques related to the implementation of in memory caches that can take advantage of previous 'computations. ! IThe fi rst contribution of this thesis is the proposal of multilayer-caches. This type of cache are able to sha re the available memory of one computer in different parts, depending on the distribution of the queries and the computa tional cos t of each block. Our results show that the use of mulllple layers for iOA provide an speedup of up to 1.25 over systems with a single layer. lOur second contribution is the proposal of the Evolutive Summary Counters (ESC) data structure. The ESC aim at storing the access frequency to documents in each node of a distributed system during a recent window of time. The ESC are implemented as a list of count bloom filters. which are ~ery efficient to build summaries (ESC-summaries) that can be broadcasted among the nodes in the distributed system In order to provide a global view of the system. I ~he third main contribution is the proposal of an algorithm that places the data in the coopera tive cache, which we call ESC-placemenl. This strategy distributes the conlents of the cache according to the ESC-summaries taking into accoun tlhe following objectives: keep a local copy of the documents \that are very frequently accessed in the cache of the node that is accessing them, keep a copy of the frequent documents in some node of the 'distributed system, and avoid that the redis tribution of the contents generate an excessive number of message chaining. lour fourth contribution is the proposal of ESC·search, which is an algorithm that decides if a documenl is avaitabla in the nel\vork and in which node. Since the larger the access frequency, the more prObable that the document is available in the cache of a compu ter, ESC·search estimates the Iprobability that a document is available for a given access frequency indicated by the ESC-surnmaries. According to previous searches. ESC·search :updales this probability: ESC-search increases it if the document is found in the remote cache, and otherwise it is decreased. Using these probabilities, ESC-search decides dynamically which are the nodes that are more probably sloring a certain cache content, and how many nodes are queried in order Ito reduce the risk that a document is available In the cooperative cache is found but it is not retrieved because the wrong nodes are queried. Our results ;ShOW that for typical search engine distributions, ESC-search retrieves 98% of the cache contents but only queries 14% of the available nodes. Finally, we propose two techniques that balanco the load of a distributed system with a cooperative cache. Caches reduce the cost to compute a query, but this may imbalance the overall system load if the hi l rale is large. Probability cost is a strategy based on costs, that takes the probabilities obtained by ESC-search, and reestimates the cost of query considering the probability of having the data already cached. The second technique. Affini ty. adds a Inew term that measures the similarity among the cache contents in one node and the query. Then, Affinity replica tes the documents that are frequent in Ithe cooperative cache, 10 improve the data access locality and reduce the overa ll computation cost. . ~he complete analysis of our techniques achieves a performance of over 6 q/s (queries per second) with 16 computers. We obtain a superlinear speedup and a t\'.'O order of magnitude improvement with respect to the original system in one computer which was able 10 answer 0.05 q/s. Lloc i

[1]  Weiguo Fan,et al.  Beyond keywords: Automated question answering on the web , 2008, CACM.

[2]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[3]  Xiao Qin,et al.  Improving the performance of I/O-intensive applications on clusters of workstations , 2006, Cluster Computing.

[4]  Kostas Papadopoulos,et al.  HelperCoreDB: Exploiting multicore technology to improve database performance , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  John H. Hartman,et al.  Efficient cooperative caching using hints , 1996, OSDI '96.

[7]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[8]  Mihai Surdeanu,et al.  Named entity recognition from spontaneous open-domain speech , 2005, INTERSPEECH.

[9]  Lakshmish Ramaswamy,et al.  Cache Clouds: Cooperative Caching of Dynamic Documents in Edge Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[10]  Yinglian Xie,et al.  Locality in search engine queries and its implications for caching , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[11]  Sanda M. Harabagiu,et al.  Performance analysis of a distributed question/answering system , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[12]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[13]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[14]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[15]  Lakshmish Ramaswamy,et al.  An expiration age-based document placement scheme for cooperative Web caching , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Dan Roth,et al.  Learning question classifiers: the role of semantic information , 2005, Natural Language Engineering.

[17]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[18]  G. Voelker,et al.  On the scale and performance of cooperative Web proxy caching , 2000, OPSR.

[19]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[20]  Marius Paca Open-Domain Question Answering from Large Text Collections , 2003, Computational Linguistics.

[21]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[22]  Mihai Surdeanu,et al.  Design and performance analysis of a factoid question answering system for spontaneous speech transcriptions , 2006, INTERSPEECH.

[23]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[24]  Sam Lightstone,et al.  Adaptive self-tuning memory in DB2 , 2006, VLDB.

[25]  Xiao Qin,et al.  Towards load balancing support for I/O-intensive parallel jobs in a cluster of workstations , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[26]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[27]  Lucian Vlad Lita,et al.  JAVELIN I and II Systems at TREC 2005 , 2005, TREC.

[28]  Zhiping Zheng,et al.  AnswerBus question answering system , 2002 .

[29]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[30]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[31]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[32]  Erich M. Nahum,et al.  Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.