A Multilingual Approach to Question Classification

In this paper we present the Konstanz Resource of Questions (KRoQ), the first dependency-parsed, parallel multilingual corpus of information-seeking and non-information-seeking questions. In creating the corpus, we employ a linguistically motivated rule-based system that uses linguistic cues from one language to help classify and annotate questions across other languages. Our current corpus includes German, French, Spanish and Koine Greek. Based on the linguistically motivated heuristics we identify, a two-step scoring mechanism assigns intraand inter-language scores to each question. Based on these scores, each question is classified as being either information seeking or non-information seeking. An evaluation shows that this mechanism correctly classifies questions in 79% of the cases. We release our corpus as a basis for further work in the area of question classification. It can be utilized as training and testing data for machine-learning algorithms, as corpus-data for theoretical linguistic questions or as a resource for further rule-based approaches to question identification.

[1]  T. Gonen,et al.  Questions , 1927, Journal of Family Planning and Reproductive Health Care.

[2]  Chung-hye Han Interpreting interrogatives as rhetorical questions , 2002 .

[3]  Huan Liu,et al.  Identifying Rhetorical Questions in Social Media , 2021, ICWSM.

[4]  Ray Cattell NEGATIVE TRANSPORTATION AND TAG QUESTIONS , 1973 .

[5]  Miles Efron,et al.  Questions are content: A taxonomy of questions in a microblogging environment , 2010, ASIST.

[6]  Ed H. Chi,et al.  What is a Question? Crowdsourcing Tweet Categorization , 2011 .

[7]  Edward Y. Chang,et al.  Question identification on twitter , 2011, CIKM '11.

[8]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[9]  Matthias Scheutz,et al.  Parallel Syntactic Annotation in CReST , 2012 .

[10]  F. Maxwell Harper,et al.  Facts or friends?: distinguishing informational and conversational questions in social Q&A sites , 2009, CHI.

[11]  Joonsuk Park,et al.  Automatic Identification of Rhetorical Questions , 2015, ACL.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Rajesh Bhatt,et al.  Argument-Adjunct Asymmetries in Rhetorical Questions , 1998 .

[14]  Jonathan Ginzburg,et al.  Self-addressed questions in disfluencies , 2013, DiSS.

[15]  Anna Demming,et al.  The best of both worlds , 2010, Nanotechnology.

[16]  Kai Wang,et al.  Exploiting Salient Patterns for Question Detection and Question Retrieval in Community-based Question Answering , 2010, COLING.

[17]  Jonas Kuhn,et al.  The Best of Both Worlds – A Graph-based Completion Model for Transition-based Parsers , 2012, EACL.

[18]  Jörg Meibauer A. RHETORISCHE FRAGEN , 1986 .

[19]  Joakim Nivre,et al.  A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing , 2012, EMNLP.

[20]  Zhe Zhao,et al.  Questions about questions: an empirical analysis of information needs on Twitter , 2013, WWW.

[21]  L. de Vries Some remarks on the use of Bible translations as parallel texts in linguistic research , 2007 .