Multi-level mining and visualization of scientific text collections: Exploring a bi-lingual scientific repository

We present a system to mine and visualize collections of scientific documents by semantically browsing information extracted from single publications or aggregated throughout corpora of articles. The text mining tool performs deep analysis of document collections allowing the extraction and interpretation of research paper's contents. In addition to the extraction and enrichment of documents with metadata (titles, authors, affiliations, etc), the deep analysis performed comprises semantic interpretation, rhetorical analysis of sentences, triple-based information extraction, and text summarization. The visualization components allow geographical-based exploration of collections, topic-evolution interpretation, and collaborative network analysis among others. The paper presents a case study of a bi-lingual collection in the field of Natural Language Processing (NLP).

[1]  Jeffrey Heer,et al.  D³ Data-Driven Documents , 2011, IEEE Transactions on Visualization and Computer Graphics.

[2]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[3]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[4]  Ulrich Schäfer,et al.  The ACL Anthology Searchbench , 2011, ACL.

[5]  Ben Shneiderman,et al.  Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization , 2012, J. Assoc. Inf. Sci. Technol..

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[9]  Daniel Jurafsky,et al.  Towards a Computational History of the ACL: 1980-2008 , 2012, Discoveries@ACL.

[10]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[11]  Roberto Navigli,et al.  Multilingual Word Sense Disambiguation and Entity Linking for Everybody , 2014, International Semantic Web Conference.

[12]  Horacio Saggion,et al.  Knowledge Extraction and Modeling from Scientific Publications , 2016 .

[13]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[14]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[15]  Horacio Saggion,et al.  SUMMA. A Robust and Adaptable Summarization Tool , 2008, TAL.

[16]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[17]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[18]  ShneidermanBen,et al.  Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization , 2012 .

[19]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[20]  Horacio Saggion,et al.  Multi-document summarization by cluster/prole relevance and redundancy removal , 2004 .

[21]  Horacio Saggion,et al.  A Multi-Layered Annotated Corpus of Scientific Papers , 2016, LREC.