Learning to summarise XML documents using content and structure

Documents formatted in eXtensible Markup Language (XML) are becoming increasingly available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. We follow a machine learning like, sentence extraction-based summarisation technique. To find which features are more effective for producing summaries this approach views sentence extraction as an ordering task. We evaluated our summarisation model using the INEX dataset. The results demonstrate that the inclusion of features from the logical structure of documents increases the effectiveness of the summariser, and that the learnable system is also effective and well-suited to the task of summarisation in the context of XML documents.

[1]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[2]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[3]  Mehmet M. Dalkilic,et al.  Semantic thumbnails: a novel method for summarizing document collections , 2004, SIGDOC '04.

[4]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[5]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[6]  Kenneth C. Litkowski,et al.  Text Summarization Using XML-Tagged Documents , 2003 .

[7]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[8]  P. Gallinari,et al.  A Data-dependent Generalisation Error Bound for the AUC , 2005 .

[9]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[10]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[11]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[12]  Yurdaer N. Doganata,et al.  Summarizing technical support documents for search: Expert and user studies , 2004, IBM Syst. J..

[13]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[14]  Derek Scott Lam,et al.  Exploiting E-mail Structure to Improve Summarization , 2002 .

[15]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[16]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[17]  Massih-Reza Amini,et al.  Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms , 2005, ECIR.

[18]  Bhavani Raskutti,et al.  Optimising area under the ROC curve using gradient descent , 2004, ICML.

[19]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[20]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[21]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[22]  Jihoon Yang,et al.  Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[23]  Gabriella Kazai,et al.  Overview of the Initiative for the Evaluation of XML retrieval (INEX) 2002 , 2002, INEX Workshop.

[24]  Fuad Rahman,et al.  Structured and unstructured document summarization:design of a commercial summarizer using Lexical chains , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[25]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[26]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[27]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[28]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[29]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[30]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[31]  Michael C. Mozer,et al.  Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic , 2003, ICML.

[32]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[33]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[34]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[35]  Douglas A. Wolfe,et al.  On Constructing Statistics and Reporting Data , 1971 .