Learning-based summarisation of XML documents

Documents formatted in eXtensible Markup Language (XML) are available in collections of various document types. In this paper, we present an approach for the summarisation of XML documents. The novelty of this approach lies in that it is based on features not only from the content of documents, but also from their logical structure. We follow a machine learning, sentence extraction-based summarisation technique. To find which features are more effective for producing summaries, this approach views sentence extraction as an ordering task. We evaluated our summarisation model using the INEX and SUMMAC datasets. The results demonstrate that the inclusion of features from the logical structure of documents increases the effectiveness of the summariser, and that the learnable system is also effective and well-suited to the task of summarisation in the context of XML documents. Our approach is generic, and is therefore applicable, apart from entire documents, to elements of varying granularity within the XML tree. We view these results as a step towards the intelligent summarisation of XML documents.

[1]  Massih-Reza Amini,et al.  Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts , 2004, RIAO.

[2]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[3]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[4]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[5]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[6]  Massih-Reza Amini,et al.  Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms , 2005, ECIR.

[7]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[8]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[9]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[10]  Kenneth C. Litkowski,et al.  Text Summarization Using XML-Tagged Documents , 2003 .

[11]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[12]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[13]  John Hutchins,et al.  SUMMARIZATION: SOME PROBLEMS AND METHODS , 2004 .

[14]  Jihoon Yang,et al.  Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[15]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[16]  Yurdaer N. Doganata,et al.  Summarizing technical support documents for search: Expert and user studies , 2004, IBM Syst. J..

[17]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[18]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[19]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[20]  Yuji Matsumoto,et al.  Extracting Important Sentences with Support Vector Machines , 2002, COLING.

[21]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[22]  Aravind K. Joshi,et al.  Ranking and Reranking with Perceptron , 2005, Machine Learning.

[23]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[24]  Derek Scott Lam,et al.  Exploiting E-mail Structure to Improve Summarization , 2002 .

[25]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[26]  Mehmet M. Dalkilic,et al.  Semantic thumbnails: a novel method for summarizing document collections , 2004, SIGDOC '04.

[27]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[28]  Mounia Lalmas,et al.  Investigating the use of summarisation for interactive XML retrieval , 2006, SAC.

[29]  Fuad Rahman,et al.  Structured and unstructured document summarization:design of a commercial summarizer using Lexical chains , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[30]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[31]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[32]  Mounia Lalmas,et al.  The Use of Summaries in XML Retrieval , 2006, ECDL.

[33]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[34]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .