论文信息 - Community based question answer detection

Community based question answer detection

Each day, millions of people ask questions and search for answers on the World Wide Web. Due to this, the Internet has grown to a world wide database of questions and answers, accessible to almost everyone. Since this database is so huge, it is hard to find out whether a question has been answered or even asked before. As a consequence, users are asking the same questions again and again, producing a vicious circle of new content which hides the important information. One platform for questions and answers are Web forums, also known as discussion boards. They present discussions as item streams where each item contains the contribution of one author. These contributions contain questions and answers in human readable form. People use search engines to search for information on such platforms. However, current search engines are neither optimized to highlight individual questions and answers nor to show which questions are asked often and which ones are already answered. In order to close this gap, this thesis introduces the Effingo system. The Effingo system is intended to extract forums from around the Web and find question and answer items. It also needs to link equal questions and aggregate associated answers. That way it is possible to find out whether a question has been asked before and whether it has already been answered. Based on these information it is possible to derive the most urgent questions from the system, to determine which ones are new and which ones are discussed and answered frequently. As a result, users are prevented from creating useless discussions, thus reducing the server load and information overload for further searches. The first research area explored by this thesis is forum data extraction. The results from this area are intended be used to create a database of forum posts as large as possible. Furthermore, it uses question-answer detection in order to find out which forum items are questions and which ones are answers and, finally, topic detection to aggregate questions on the same topic as well as discover duplicate answers. These areas are either extended by Effingo, using forum specific features such as the user graph, forum item relations and forum link structure, or adapted as a means to cope with the specific problems created by user generated content. Such problems arise from poorly written and very short texts as well as from hidden or distributed information.

Klemens Muthmann | Klemens Muthmann

[1] Yida Wang,et al. iRobot: an intelligent crawler for web forums , 2008, WWW.

[2] William W. Cohen. Fast Effective Rule Induction , 1995, ICML.

[3] Fabio Rinaldi,et al. Answering Questions in the Genomics Domain , 2004, ACL 2004.

[4] J. R. Quinlan. Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[5] Philip Resnik,et al. Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6] Gilad Mishne,et al. Finding high-quality content in social media , 2008, WSDM '08.

[7] Christiane Fellbaum,et al. Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[8] Patrick Saint-Dizier,et al. Advanced Relaxation for Cooperative Question Answering , 2004, New Directions in Question Answering.

[9] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[10] Bonnie Webber,et al. Annotating CBC4Kids: A Corpus for Reading Comprehension and Question Answering Evaluation , 2004 .

[11] Eugene Agichtein,et al. Finding the right facts in the crowd: factoid question answering over social media , 2008, WWW.