Citance-based retrieval and summarization using IR and machine learning

We consider the three interesting problems posed by the CL-SciSumm series of shared tasks. Given a reference document D and a set $$C_D$$CD of citances for D: (1) find the span of reference text that corresponds to each citance $$c \in C_D$$c∈CD, (2) identify the facet corresponding to each span of reference text from a predefined list of five facets, and (3) construct a summary of at most 250 words for D based on the reference spans. The shared task provided annotated training and test sets for these problems. This paper describes our efforts and the results achieved for each problem, and also a discussion of some interesting parameters of the datasets, which may spur further improvements and innovations.

[1]  Rakesh M. Verma,et al.  Combining Syntax and Semantics for Automatic Extractive Single-Document Summarization , 2012, CICLing.

[2]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[3]  Dapeng Wu,et al.  PolyU at CL-SciSumm 2016 , 2016, BIRNDL@JCDL.

[4]  Dragomir R. Radev,et al.  Generating Extractive Summaries of Scientific Paradigms , 2013, J. Artif. Intell. Res..

[5]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Dragomir R. Radev,et al.  Using Citations to Generate surveys of Scientific Paradigms , 2009, NAACL.

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Animesh Prasad WING-NUS at CL-SciSumm 2017: Learning from Syntactic and Semantic Similarity for Citation Contextualization , 2017, BIRNDL@SIGIR.

[10]  Sujian Li,et al.  PKU @ CLSciSumm-17: Citation Contextualization , 2017, BIRNDL@SIGIR.

[11]  Noriko Kando,et al.  Classification of research papers using citation links and citation types: Towards automatic review article generation. , 2011 .

[12]  Yusuke Miyao,et al.  Encoding Generalized Quantifiers in Dependency-based Compositional Semantics , 2014, PACLIC.

[13]  Rakesh M. Verma,et al.  University of Houston at CL-SciSumm 2016: SVMs with tree kernels and Sentence Similarity , 2016, BIRNDL@JCDL.

[14]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[15]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[16]  Rakesh M. Verma,et al.  Extractive Summarization: Limits, Compression, Generalized Model and Heuristics , 2018 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[18]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[19]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008 .

[22]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[23]  William H. DuBay The Principles of Readability. , 2004 .

[24]  Rakesh M. Verma,et al.  University of Houston @ CL-SciSumm 2017: Positional language Models, Structural Correspondence Learning and Textual Entailment , 2017, BIRNDL@SIGIR.

[25]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[26]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[27]  Rakesh M. Verma,et al.  Identifying reference spans: topic modeling and word embeddings help IR , 2017, International Journal on Digital Libraries.

[28]  R. Pontius,et al.  Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment , 2011 .

[29]  Roman Kern,et al.  Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies , 2017, BIRNDL@SIGIR.

[30]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[31]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[32]  Horacio Saggion,et al.  LaSTUS/TALN @ CLSciSumm-17: Cross-document Sentence Matching and Scientific Text Summarization Systems , 2017, BIRNDL@SIGIR.

[33]  Yusuke Miyao,et al.  Logical Inference on Dependency-based Compositional Semantics , 2014, ACL.

[34]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[35]  Dipankar Das,et al.  SciSumm 2017: Employing Word Vectors for Identifying, Classifying and Summarizing Scientific Documents , 2017, BIRNDL@SIGIR.

[36]  Marti A. Hearst,et al.  Citances: Citation Sentences for Semantic Analysis of Bioscience Text , 2004 .

[37]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[38]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008, J. Assoc. Inf. Sci. Technol..

[39]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[42]  Min-Yen Kan,et al.  Overview of the CL-SciSumm 2016 Shared Task , 2016, BIRNDL@JCDL.

[43]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[44]  Jian Xu,et al.  Recognizing Reference Spans and Classifying their Discourse Facets , 2016, BIRNDL@JCDL.

[45]  Mingbo Ma,et al.  Textual Entailment with Structured Attentions and Composition , 2016, COLING.

[46]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[47]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[48]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.