Combining Labeled and Unlabeled Data for Learning Cross-Document Structural Relationships

Multi-document discourse analysis has emerged with the potential of improving various NLP applications. Based on the newly proposed Cross-document Structure Theory (CST), this paper describes an empirical study that classifies CST relationships between sentence pairs extracted from topically related documents, exploiting both labeled and unlabeled data. We investigate a binary classifier for determining existence of structural relationships and a full classifier using the full taxonomy of relationships. We show that in both cases the exploitation of unlabeled data helps improve the performance of learned classifiers.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Zhu Zhang,et al.  Towards CST-enhanced summarization , 2002, AAAI/IAAI.

[5]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Steven P. Abney,et al.  Bootstrapping , 2002, ACL.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[10]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[11]  Daniel Marcu,et al.  An Unsupervised Approach to Recognizing Discourse Relations , 2002, ACL.

[12]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[13]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[16]  Claire Cardie,et al.  Weakly Supervised Natural Language Learning Without Redundant Views , 2003, NAACL.

[17]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[18]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[19]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[20]  Dragomir R. Radev,et al.  Learning cross-document structural relationships using boosting , 2003, CIKM '03.

[21]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[22]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.