TechMiner: Extracting Technologies from Academic Publications

In recent years we have seen the emergence of a variety of scholarly datasets. Typically these capture 'standard' scholarly entities and their connections, such as authors, affiliations, venues, publications, citations, and others. However, as the repositories grow and the technology improves, researchers are adding new entities to these repositories to develop a richer model of the scholarly domain. In this paper, we introduce TechMiner, a new approach, which combines NLP, machine learning and semantic technologies, for mining technologies from research publications and generating an OWL ontology describing their relationships with other research entities. The resulting knowledge base can support a number of tasks, such as: richer semantic search, which can exploit the technology dimension to support better retrieval of publications; richer expert search; monitoring the emergence and impact of new technologies, both within and across scientific fields; studying the scholarly dynamics associated with the emergence of new technologies; and others. TechMiner was evaluated on a manually annotated gold standard and the results indicate that it significantly outperforms alternative NLP approaches and that its semantic features improve performance significantly with respect to both recall and precision.

[1]  Bahar Sateli,et al.  What's in this paper?: Combining Rhetorical Entities with Linked Open Data for Semantic Literature Querying , 2015, WWW.

[2]  Sue Simpson,et al.  Use of the Internet in Scanning the Horizon for New and Emerging Health Technologies: A Survey of Agencies Involved in Horizon Scanning , 2003, Journal of medical Internet research.

[3]  Sören Auer,et al.  AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data , 2014, International Semantic Web Conference.

[4]  Horacio Saggion,et al.  Dr. Inventor Framework: Extracting Structured Information from Scientific Publications , 2015, Discovery Science.

[5]  Simone Teufel,et al.  Corpora for the Conceptualisation and Zoning of Scientific Papers , 2010, LREC.

[6]  Wei Huang,et al.  DO ABCs GET MORE CITATIONS THAN XYZs , 2015 .

[7]  Tudor Groza,et al.  Using Typed Dependencies to Study and Recognise Conceptualisation Zones in Biomedical Literature , 2013, PloS one.

[8]  Isabelle Augenstein,et al.  LODifier: Generating Linked Data from Unstructured Text , 2012, ESWC.

[9]  Enrico Motta,et al.  Exploring Scholarly Data with Rexplore , 2013, International Semantic Web Conference.

[10]  Gilles Falquet,et al.  An Automated Annotation Process for the SciDocAnnot Scientific Document Model , 2015, SDA@TPDL.

[11]  Amedeo Napoli,et al.  Ontology-guided data preparation for discovering genotype-phenotype relationships , 2008, BMC Bioinformatics.

[12]  Enrico Motta,et al.  Combining NLP And Semantics For Mining Software Technologies From Research Publications , 2016, WWW.

[13]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[14]  Silvio Peroni,et al.  FaBiO and CiTO: Ontologies for describing bibliographic resources and citations , 2012, J. Web Semant..

[15]  Adelaide V. Finch,et al.  September , 1867, The Hospital.

[16]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[17]  Paul Buitelaar,et al.  Domain-independent term extraction through domain modelling , 2013 .

[18]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[19]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[20]  Simone Teufel,et al.  Unsupervised learning of rhetorical structure with un-topic models , 2014, COLING.

[21]  David N. Kennedy,et al.  The Resource Identification Initiative: A cultural shift in publishing , 2015, Neuroinformatics.

[22]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[23]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[24]  Michel Dumontier,et al.  Bio2RDF Release 3: A larger, more connected network of Linked Data for the Life Sciences , 2014, SEMWEB.

[25]  Gilles Falquet,et al.  User-centric design and evaluation of a semantic annotation model for scientific documents , 2014, i-KNOW '14.

[26]  Siegfried Handschuh,et al.  Recipes for Semantic Web Dog Food - The ESWC and ISWC Metadata Projects , 2007, ISWC/ASWC.

[27]  Eric SanJuan,et al.  Annotation of Scientific Summaries for Information Retrieval , 2011, ESAIR 2011.

[28]  Hugh Glaser,et al.  Knowledge-Enabled Research Support: RKBExplorer.com , 2009 .