Summarizing figures, tables, and algorithms in scientific publications to augment search results

Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

[1]  I. C. Mogotsi,et al.  Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to information retrieval , 2010, Information Retrieval.

[2]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[3]  Divesh Srivastava,et al.  Meta-data indexing for XPath location steps , 2006, SIGMOD Conference.

[4]  George Kollios,et al.  Complex Spatio-Temporal Pattern Queries , 2005, VLDB.

[5]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[6]  Rebecca J. Passonneau,et al.  Generating Summaries of Work Flow Diagrams , 2007 .

[7]  Carol Tenopir,et al.  Finding and using journal-article components: Impacts of disaggregation on teaching and research practice , 2008, J. Assoc. Inf. Sci. Technol..

[8]  Youngjoong Ko,et al.  An effective sentence-extraction technique using contextual information and statistical approaches for text summarization , 2008, Pattern Recognition Letters.

[9]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[10]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[13]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[14]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[15]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  MitraPrasenjit,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012 .

[17]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[18]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[19]  Ingrid Zukerman,et al.  Exploring and Exploiting the Limited Utility of Captions in Recognizing Intention in Information Graphics , 2005, ACL.

[20]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[21]  Robert P. Futrelle Handling Figures in Document Summarization , 2004 .

[22]  Shibamouli Lahiri,et al.  Generating synopses for document-element search , 2009, CIKM.

[23]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[24]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[25]  Chew Lim Tan,et al.  Associating text and graphics for scientific chart understanding , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[26]  M. Corio,et al.  Generation of texts for information graphics , 1999 .

[27]  Guoping Wang,et al.  Learning with progressive transductive Support Vector Machine , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  Ryen W. White,et al.  A task-oriented study on the influencing effects of query-biased summarisation in web searching , 2003, Inf. Process. Manag..

[29]  George R. Thoma,et al.  Annotation and retrieval of clinically relevant images , 2009, Int. J. Medical Informatics.

[30]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[31]  Tapas Kanungo,et al.  Machine Learned Sentence Selection Strategies for Query-Biased Summarization , 2008 .

[32]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[33]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[34]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[35]  Neil C. Rowe,et al.  Natural-language retrieval of images based on descriptive captions , 1996, TOIS.

[36]  Guoping Wang,et al.  Learning with progressive transductive support vector machine , 2003, Pattern Recognit. Lett..

[37]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[38]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[39]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[40]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  Robert P. Futrelle,et al.  Summarization of Diagrams in Documents , 1999 .

[43]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[44]  C. Lee Giles,et al.  Finding algorithms in scientific articles , 2010, WWW '10.