Community based question answer detection

Each day, millions of people ask questions and search for answers on the World Wide Web. Due to this, the Internet has grown to a world wide database of questions and answers, accessible to almost everyone. Since this database is so huge, it is hard to find out whether a question has been answered or even asked before. As a consequence, users are asking the same questions again and again, producing a vicious circle of new content which hides the important information. One platform for questions and answers are Web forums, also known as discussion boards. They present discussions as item streams where each item contains the contribution of one author. These contributions contain questions and answers in human readable form. People use search engines to search for information on such platforms. However, current search engines are neither optimized to highlight individual questions and answers nor to show which questions are asked often and which ones are already answered. In order to close this gap, this thesis introduces the Effingo system. The Effingo system is intended to extract forums from around the Web and find question and answer items. It also needs to link equal questions and aggregate associated answers. That way it is possible to find out whether a question has been asked before and whether it has already been answered. Based on these information it is possible to derive the most urgent questions from the system, to determine which ones are new and which ones are discussed and answered frequently. As a result, users are prevented from creating useless discussions, thus reducing the server load and information overload for further searches. The first research area explored by this thesis is forum data extraction. The results from this area are intended be used to create a database of forum posts as large as possible. Furthermore, it uses question-answer detection in order to find out which forum items are questions and which ones are answers and, finally, topic detection to aggregate questions on the same topic as well as discover duplicate answers. These areas are either extended by Effingo, using forum specific features such as the user graph, forum item relations and forum link structure, or adapted as a means to cope with the specific problems created by user generated content. Such problems arise from poorly written and very short texts as well as from hidden or distributed information.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[3]  Fabio Rinaldi,et al.  Answering Questions in the Genomics Domain , 2004, ACL 2004.

[4]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[5]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[7]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[8]  Patrick Saint-Dizier,et al.  Advanced Relaxation for Cooperative Question Answering , 2004, New Directions in Question Answering.

[9]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[10]  Bonnie Webber,et al.  Annotating CBC4Kids: A Corpus for Reading Comprehension and Question Answering Evaluation , 2004 .

[11]  Eugene Agichtein,et al.  Finding the right facts in the crowd: factoid question answering over social media , 2008, WWW.

[12]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[13]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[14]  Leon Nelson Flint Newspaper Writing in High Schools: Containing an Outline for the Use of Teachers , 2009 .

[15]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[16]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[17]  Jonathan Yamron,et al.  Statistical models of topical content , 2002 .

[18]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[19]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[20]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[21]  Chun-hung Li,et al.  Topic Detection in Online Discussion Using Non-negative Matrix Factorization , 2007, 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.

[22]  W. Bruce Croft,et al.  A framework to predict the quality of answers with non-textual features , 2006, SIGIR.

[23]  Nayer M. Wanas,et al.  Automatic scoring of online discussion posts , 2008, WICOW '08.

[24]  Jiunn-Liang Guo,et al.  Journal Article Topic Detection Based on Semantic Features , 2009, IEA/AIE.

[25]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[26]  John Mingers,et al.  Rule Induction with Statistical Data—A Comparison with Multiple Regression , 1987 .

[27]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[28]  Wei-Ying Ma,et al.  Building implicit links from content for forum search , 2006, SIGIR.

[29]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007 .

[30]  E. Hovy,et al.  Mining and Assessing Discussions on the Web through Speech Act Analysis , 2006 .

[31]  Luo Si,et al.  A Probabilistic Framework for Answer Selection in Question Answering , 2007, NAACL.

[32]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[33]  J. Orbach Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. , 1962 .

[34]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[35]  D. Y. Chechelnytskyy,et al.  Wolfram Alpha: computational knowledge engine , 2012 .

[36]  J. Michael Schultz,et al.  Towards a Universal dictionary for multi-language information retrieval applications , 2002 .

[37]  Vincent D. Blondel,et al.  Automatic Discovery of SimilarWords , 2008 .

[38]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[39]  Xiaofeng Meng,et al.  Schema-guided wrapper maintenance for web-data extraction , 2003, WIDM '03.

[40]  Liwen Vaughan,et al.  New measurements for search engine evaluation proposed and tested , 2004, Inf. Process. Manag..

[41]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[42]  Brian D. Davison,et al.  A classification-based approach to question answering in discussion boards , 2009, SIGIR.

[43]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[44]  Padmini Srinivasan,et al.  A cluster-based approach to broadcast news , 2002 .

[45]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[46]  Tim Leek,et al.  Probabilistic approaches to topic detection and tracking , 2002 .

[47]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[48]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[49]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[50]  Hyoil Han,et al.  Drexel at TREC 2007: Question Answering , 2007, TREC.

[51]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[52]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[53]  Chen Lin,et al.  Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications , 2009, SIGIR.

[54]  Johannes Fürnkranz,et al.  Incremental Reduced Error Pruning , 1994, ICML.

[55]  Kevin Humphreys,et al.  New Directions in Question Answering , 2006, Information Retrieval.

[56]  James Allan,et al.  Explorations within topic tracking and detection , 2002 .

[57]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[58]  Xiaoyan Zhu,et al.  Using Conditional Random Fields to Extract Contexts and Answers of Questions from Online Forums , 2008, ACL.

[59]  Alexander Schill,et al.  An Optimized Web Feed Aggregation Approach for Generic Feed Types , 2011, ICWSM.

[60]  Gina-Anne Levow,et al.  Signal boosting for translingual topic tracking: document expansion and n-best translation , 2002 .

[61]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[63]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[64]  Gideon S. Mann,et al.  Analyses for elucidating current question answering technology , 2001, Natural Language Engineering.

[65]  Stephen Chi-fai Chan,et al.  Automatic Template Detection for Structured Web Pages , 2006, 2006 10th International Conference on Computer Supported Cooperative Work in Design.

[66]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[67]  Xijin Tang,et al.  Approach to Detection of Community's Consensus and Interest , 2008, APWeb Workshops.

[68]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[69]  Jihie Kim,et al.  Profiling Student Interactions in Threaded Discussions with Speech Act Classifiers , 2007, AIED.

[70]  Satya Dharanipragada,et al.  Segmentation and Detection at IBM , 2002 .

[71]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[72]  Tobun Dorbin Ng,et al.  Analyzing content development and visualizing social interactions in Web forum , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[73]  Sanda M. Harabagiu,et al.  Question Answering Based on Semantic Structures , 2004, COLING.

[74]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[75]  Fabio Rinaldi,et al.  Answer Extraction in Technical Domains , 2002, CICLing.

[76]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[77]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[78]  Guangyu Chen,et al.  Web page genre classification , 2008, SAC '08.

[79]  Csr Young,et al.  How to Do Things With Words , 2009 .

[80]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[81]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[82]  Valentin Jijkoun,et al.  Answer Selection in a Multi-stream Open Domain Question Answering System , 2004, ECIR.