Building a Chinese discourse topic corpus with a micro-topic scheme based on theme-rheme theory

BackgroundHow to build a suitable discourse topic structure is an important issue in discourse topic analysis, which is the core of natural language understanding. Not only is it the key basic unit to implement automatic computing, but also the key to realize the transformation from unstructured data to structured data during the process of big data analytics. Although the discourse topic structure has wide potential for application in discourse analysis and related tasks, the research on constructing such discourse resources is quite limited in Chinese language. In this paper, we propose a micro-topic scheme (MTS) to represent the discourse topic structure in the Chinese language according to theme-rheme theory, with elementary discourse topic unit(EDTU) as the node and referent of theme-rheme as link. In particular, thematic progression is employed to directly represent the development of the discourse topic structure.ResultsGuided by the MTS, we manually annotate a Chinese Discourse Topic Corpus (CDTC) of 500 documents. Moreover, we get 89.9 and 72.15 F1 value in two identification preliminary experiments, respectively, which show that the proposed representation can perform good automatic computation.ConclusionThe lack of the formal representation system and related corpus resources for Chinese discourse topic structure has greatly restricted the study of discourse topic analysis in natural language, and further affected the development of natural language understanding. To address the above issues, a micro-topic scheme(MTS) representation is proposed based on functional grammar theory, and the corresponding corpus resources(i.e., CDTC) are constructed. Our preliminary evaluation justifies the appropriateness of the MTS for Chinese discourse analysis and the usefulness of our CDTC.

[1]  Yuping Zhou,et al.  PDTB-style Discourse Annotation of Chinese Text , 2012, ACL.

[2]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[3]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[4]  Hwee Tou Ng,et al.  Semantic Role Labeling of NomBank: A Maximum Entropy Approach , 2006, EMNLP.

[5]  Nianwen Xue,et al.  Annotating Discourse Connectives in the Chinese Treebank , 2005, FCA@ACL.

[6]  Yue Ming,et al.  Rhetorical Structure Annotation of Chinese News Commentaries , 2008 .

[7]  Joachim Bingel,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 2016 .

[8]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[9]  Hai Zhao,et al.  Improving Nominal SRL in Chinese Language with Verbal SRL Information and Automatic Predicate Recognition , 2009, EMNLP.

[10]  Guodong Zhou,et al.  Improving Syntactic Parsing of Chinese with Empty Element Recovery , 2013, Journal of Computer Science and Technology.

[11]  Jingyi Wang,et al.  On Generalized-Topic-Based Chinese Discourse Structure , 2010, CIPS-SIGHAN.

[12]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[13]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[14]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[15]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[16]  Nianwen Xue,et al.  Chinese sentence segmentation as comma classification , 2011, ACL.

[17]  R. Beaugrande,et al.  Introduction to text linguistics , 1981 .

[18]  Nianwen Xue,et al.  Improving the Inference of Implicit Discourse Relations via Classifying Explicit Discourse Connectives , 2015, NAACL.

[19]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[20]  F. Daneš Functional Sentence Perspective and the Organization of the Text , 1974 .

[21]  Wendan Li Topic chains in Chinese : a discourse analysis and applications in language teaching , 2005 .

[22]  Chengqing Zong,et al.  Multi-Predicate Semantic Role Labeling , 2014, EMNLP.

[23]  Fang Kong,et al.  Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure , 2014, EMNLP.

[24]  Yi-Chun Chen,et al.  Zero anaphora resolution in Chinese with partial parsing based on centering theory , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.