Scenarios for Advanced Services in an ETD Digital Library

ETDs, typically in PDF, are a largely untapped international resource. Digital libraries (DLs) with advanced services can effectively address the broad needs to discover and utilize ETDs of interest. DLs support indexing, searching, and browsing. However, when only metadata is available, these capabilities are insufficient. Using full text to extend faceted searching provides improvement, but adds noise and reduces precision. Natural language processing (NLP), e.g., information extraction (IE), yields additional improvement, but results are like with Google Search. Google Scholar and CiteSeerX -- which extract, analyze, and link references in short publications -- provide additional capabilities, but do not work well with ETDs (due to length, complexity, and domain variations). We are working toward a tailored DL for English ETDs with special services -- including for processing references and citations, as well as for extraction from chapters, sections, and subsections -- that review the literature, state hypotheses, list research questions, explain the approach, describe methods, summarize results, discuss findings, draw conclusions, and provide insights about open problems. Such a domain independent DL can be prototyped now, using advanced NLP and IE techniques, coupled with machine learning and information retrieval methods. The result would enable stakeholders to engage in more advanced scenarios.

[1]  C. Lee Giles,et al.  Automatic Extraction of Figures from Scholarly Documents , 2015, DocEng.

[2]  C. Lee Giles,et al.  An Architecture for Information Extraction from Figures in Digital Libraries , 2015, WWW.

[3]  C. Lee Giles,et al.  Curve separation for line graphs in scholarly documents , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[4]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[7]  Edward A. Fox,et al.  Enhanced Browsing System for Electronic Theses and Dissertations , 2011 .

[8]  Shan Carter,et al.  Attention and Augmented Recurrent Neural Networks , 2016 .

[9]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[12]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[13]  Suzanne Fricke,et al.  Semantic Scholar , 2018, Journal of the Medical Library Association : JMLA.

[14]  James Allan,et al.  Passage Retrieval and Evaluation , 2005 .

[15]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[16]  Madian Khabsa,et al.  Big Scholarly Data in CiteSeerX: Information Extraction from the Web , 2015, WWW.

[17]  C. Lee Giles,et al.  CiteSeerx: A Cloud Perspective , 2010, HotCloud.

[18]  C. Lee Giles The Future of CiteSeer: CiteSeerx , 2006, PKDD.

[19]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[20]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[21]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[22]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[23]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[24]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  C. Lee Giles,et al.  Scalable algorithms for scholarly figure mining and semantics , 2016, SBD '16.

[27]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.