LCQMC:A Large-scale Chinese Question Matching Corpus

The lack of large-scale question matching corpora greatly limits the development of matching methods in question answering (QA) system, especially for non-English languages. To ameliorate this situation, in this paper, we introduce a large-scale Chinese question matching corpus (named LCQMC), which is released to the public1. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. How to collect a large number of question pairs in variant linguistic forms, which may present the same intent, is the key point for such corpus construction. In this paper, we first use a search engine to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs. After this process, a question matching corpus that contains 260,068 question pairs is constructed. In order to verify the LCQMC corpus, we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs, and test several well-known sentence matching methods on it. The experimental results not only demonstrate the good quality of LCQMC but also provide solid baseline performance for further researches on this corpus.

[1]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[2]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[3]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[4]  Chris Callison-Burch,et al.  SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) , 2015, *SEMEVAL.

[5]  Miles Osborne,et al.  Using paraphrases for improving first story detection in news and Twitter , 2012, HLT-NAACL.

[6]  Wenpeng Yin,et al.  Discriminative Phrase Embedding for Paraphrase Identification , 2016, NAACL.

[7]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[8]  Zhoujun Li,et al.  Learning Distributed Representations of Data in Community Question Answering for Question Retrieval , 2016, WSDM.

[9]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[10]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[11]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[12]  Xuanjing Huang,et al.  FudanNLP: A Toolkit for Chinese Natural Language Processing , 2013, ACL.

[13]  Danielle S. McNamara,et al.  The User-Language Paraphrase Corpus , 2011 .

[14]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[15]  Vasile Rus,et al.  On Paraphrase Identification Corpora , 2014, LREC.

[16]  Chris Callison-Burch,et al.  The Multilingual Paraphrase Database , 2014, LREC.

[17]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[18]  Mirella Lapata,et al.  A Comparison of Vector-based Representations for Semantic Composition , 2012, EMNLP.

[19]  Jakob Uszkoreit,et al.  Neural Paraphrase Identification of Questions with Noisy Pretraining , 2017, SWCN@EMNLP.

[20]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[21]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[22]  Chris Brew,et al.  SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge , 2013, *SEMEVAL.

[23]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[24]  Nobal B. Niraula,et al.  The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts , 2012 .

[25]  Jimmy Xiangji Huang,et al.  Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval , 2017, IEEE Transactions on Knowledge and Data Engineering.

[26]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[27]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[28]  Wang Ling,et al.  Paraphrasing 4 Microblog Normalization , 2013, EMNLP.

[29]  Wei Xu,et al.  Gathering and Generating Paraphrases from Twitter with Application to Normalization , 2013, BUCC@ACL.