Methods for the semantic analysis of document markup

We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals.For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e.\ topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles' document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document.For comparing and mining the resulting multi-layered\linebreak XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.

[1]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[2]  David G. Durand,et al.  What is text, really? , 1990, J. Comput. High. Educ..

[3]  김환용,et al.  미상신호 검출을 위한 통합 IDS 설계에 관한 연구 , 2003 .

[4]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[5]  Leonard Muellner,et al.  DocBook: The Definitive Guide with CD-ROM , 1999 .

[6]  Mark Weiser,et al.  TEXTNET: a network-based approach to text handling , 1986, TOIS.

[7]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[8]  Noriko Kando Text-Level Structure of Research Papers: Implications for Text-Based Information Processing Systems , 1997, BCS-IRSG Annual Colloquium on IR Research.

[9]  Andreas Witt Meaning and interpretation of concurrent markup , 2002 .

[10]  Simone Teufel,et al.  Argumentative zoning information extraction from scientific text , 1999 .

[11]  Michael ODonnell,et al.  RSTTool 2.4 - A markup Tool for Rhetorical Structure Theory , 2000, INLG.

[12]  Matthias Dimter Textklassenkonzepte heutiger Alltagssprache , 1981 .

[13]  Robert-Alain de Beaugrande,et al.  Einfuhrung in die Textlinguistik , 1973 .

[14]  Elizabeth Du,et al.  The discourse-level structure of empirical abstracts: an exploratory study , 1991, Inf. Process. Manag..

[15]  C. M. Sperberg-McQueen,et al.  Drawing inferences on the basis of markup , 2002, Extreme Markup Languages®.

[16]  C. M. Sperberg-McQueen,et al.  Towards a semantics for XML markup , 2002, DocEng '02.

[17]  Elisabeth Gülich Textsorten in der Kommunikationspraxis , 1986 .

[18]  Norman J. Walsh,et al.  DocBook: The Definitive Guide , 1999 .

[19]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[20]  Bernhard Schröder Pro-SGML: Ein Prolog-basiertes System zum Textretrieval , 1997, GLDV-Jahrestagung.

[21]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.

[22]  David Garlan,et al.  Lightweight structure in text , 2002 .

[23]  Jean Carletta,et al.  An annotation scheme for discourse-level argumentation in research articles , 1999, EACL.

[24]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994 .

[25]  David G. Durand,et al.  Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies , 1993 .

[26]  James A. Thom,et al.  Indexing Documents for Queries on Structure, Content and Attributes , 1997 .

[27]  Dafydd Gibbon,et al.  Acquiring lexical information from multilevel temporal annotations , 2003, INTERSPEECH.

[28]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[29]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[30]  Andreas Witt,et al.  Co-reference annotation and resources: A multilingual corpus of typologically diverse languages , 2002, LREC.

[31]  William C. Mann,et al.  Rhetorical structure theory and text analysis , 1989 .