Data-driven Approaches for Information Structure Identification

This paper investigates automatic identification of Information Structure (IS) in texts. The experiments use the Prague Dependency Treebank which is annotated with IS following the Praguian approach of Topic Focus Articulation. We automatically detect t(opic) and f(ocus), using node attributes from the treebank as basic features and derived features inspired by the annotation guidelines. We present the performance of decision trees (C4.5), maximum entropy, and rule induction (RIPPER) classifiers on all tectogrammatical nodes. We compare the results against a baseline system that always assigns f(ocus) and against a rule-based system. The best system achieves an accuracy of 90.69%, which is a 44.73% improvement over the baseline (62.66%).

[1]  František Čermák,et al.  Czech National Corpus: A Case in Many Contexts , 1997 .

[2]  M. Halliday NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2 , 1967 .

[3]  Mark Steedman,et al.  Discourse and Information Structure , 2003, J. Log. Lang. Inf..

[4]  Eva Hajicová,et al.  Tagging of very large corpora: Topic-Focus Articulation , 2000, COLING.

[5]  Eva Hajicová,et al.  Topic-focus and Salience , 2001, ACL.

[6]  Mark Steedman,et al.  Information Based Intonation Synthesis , 1994, HLT.

[7]  P. Sgall,et al.  Topic-focus articulation, tripartite structures, and semantic content , 1998 .

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Gregory Ward,et al.  Discourse and Information Structure , 2005 .

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Ivana Kruijff-Korbayová,et al.  Producing Contextually Appropriate Intonation is an Information-State Based Dialogue System , 2003, EACL.

[12]  Johanna D. Moore,et al.  Generating Tailored, Comparative Descriptions in Spoken Dialogue , 2004, FLAIRS Conference.

[13]  Maria Vilkuna,et al.  On Rheme and Kontrast , 1998 .

[14]  Eva Hajicová,et al.  Annotators' Agreement: The Case of Topic-Focus Articulation , 2004, LREC.

[15]  Mark Steedman,et al.  Information Structure and the Syntax-Phonology Interface , 2000, Linguistic Inquiry.

[16]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[17]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[18]  Malgorzata Stys,et al.  Incorporating Discourse Aspects in English - Polish MT: Towards Robust Implementation , 1995, ArXiv.

[19]  Larry Harrison,et al.  The Information Component , 1989 .