Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

In the current technology dominated world, interoperability of systems managed by different organisations is an essential property enabling the provision of services at a global scale. In the Text and Data Mining field (TDM), interoperability of systems offering access to text corpora offers the opportunity of increasing the uptake and impact of TDM applications. The global corpus of all research papers, i.e. the collection of human knowledge so large no one can ever read in their lifetime, represents one of the most exciting opportunities for TDM. Although the Open Access movement, which has been advocating for free availability and reuse rights to TDM from research papers, has achieved some major successes on the legal front, the technical interoperability of systems offering free access to research papers continues to be a challenge. COnnecting REpositories (CORE) (Knoth and Zdrahal, 2012) aggregates the world’s open access full-text scientific manuscripts from repositories, journals and publisher systems. One of the main goals of CORE is to harmonise and pre-process these data to lower the barrier for TDM. In this paper, we report on the preliminary results of an interoperability survey of systems provided by journal publishers, both open access and toll access. This helps us to assess the current level of systems’ interoperability and suggest ways forward.

[1]  ABOUT IIT BOMBAY & , 2022 .

[2]  Patrick Paroubek,et al.  NLP4NLP: The Cobbler's Children Won't Go Unshod , 2015, D Lib Mag..

[3]  Piek T. J. M. Vossen,et al.  Semantic Relations between Events and their Time, Locations and Participants for Event Coreference Resolution , 2013, RANLP.

[4]  J. Cohn Citizen Science: Can Volunteers Do Real Research? , 2008 .

[5]  Pushpak Bhattacharyya,et al.  Transliteration involving English and Hindi languages using Syllabification Approach , 2009 .

[6]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[7]  Piek T. J. M. Vossen,et al.  The Event and Implied Situation Ontology (ESO): Application and Evaluation , 2016, LREC.

[8]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[9]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[10]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Constantin F. Aliferis,et al.  Extracting Drug-Drug Interaction Articles from MEDLINE to Improve the Content of Drug Databases , 2005, AMIA.

[13]  John P. McCrae,et al.  Toward a truly multilingual GlobalWordnet Grid , 2016, GWC.

[14]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[15]  Egoitz Laparra,et al.  Predicate Matrix: extending SemLink through WordNet mappings , 2014, LREC.

[16]  Staša Milojević,et al.  Network Analysis and Indicators , 2014 .

[17]  J. Gumperz The Sociolinguistic Significance of Conversational Code-Switching , 1977 .

[18]  Dieter Fensel,et al.  Towards design patterns for ontology alignment , 2008, SAC '08.

[19]  Erik Wilde,et al.  URI Fragment Identifiers for the text/plain Media Type , 2008, RFC.

[20]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[21]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[22]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[23]  政治科学,et al.  Berne Convention for the Protection of Literary and Artistic Works , 2013 .

[24]  Denise DiPersio,et al.  A License Scheme for a Global Federated Language Service Infrastructure , 2015, WLSI.

[25]  Paul T. Groth,et al.  The anatomy of a nanopublication , 2010, Inf. Serv. Use.

[26]  S. Patil,et al.  Use of Google Translate in medical communication: evaluation of accuracy , 2014, BMJ : British Medical Journal.

[27]  Thomas Margoni,et al.  Re-use of public sector information in cultural heritage institutions , 2014 .

[28]  Alex Bavelas,et al.  Communication Patterns in Task‐Oriented Groups , 1950 .

[29]  Dietmar Wolfram,et al.  Measuring Scholarly Impact: Methods and Practice , 2014 .

[30]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[31]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[32]  Paramita Mirza,et al.  An Analysis of Causality between Events and its Relation to Temporal Information , 2014, COLING.

[33]  Patrick Paroubek,et al.  Rediscovering 15 Years of Discoveries in Language Resources and Evaluation: The LREC Anthology Analysis , 2014, LREC.

[34]  Gil Francopoulo,et al.  TagParser: well on the way to ISO-TC37 conformance , 2008 .

[35]  Wendy W. Chapman,et al.  Public sharing of research datasets: A pilot study of associations , 2010, J. Informetrics.

[36]  P. Hugenholtz Why the Copyright Directive is Unimportant, and Possibly Invalid , 2000 .

[37]  Antske Fokkens,et al.  NAF and GAF: Linking Linguistic Annotations , 2014 .

[38]  Jérôme Euzenat,et al.  Similarity-Based Ontology Alignment in OWL-Lite , 2004, ECAI.

[39]  Charles E. Osgood,et al.  Bibliographical citation characteristics of the psychological journal network in 1950 and in 1960. , 1967 .

[40]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[41]  Beth M. Sundheim,et al.  Overview of Results of the MUC-6 Evaluation , 1995, MUC.

[42]  Sharvari Govilkar,et al.  A Framework for Sentiment Analysis in Hindi using HSWN , 2015 .

[43]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[44]  John P. McCrae,et al.  CILI: the Collaborative Interlingual Index , 2016, GWC.

[45]  Henk F. Moed,et al.  Citation Analysis in Research Evaluation , 1899 .

[46]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[47]  Susumu Goto,et al.  Network-Based Analysis and Characterization of Adverse Drug-Drug Interactions , 2011, J. Chem. Inf. Model..

[48]  Ralf Steinberger,et al.  JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[49]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[50]  L. Guibault,et al.  Safe to Be Open: Study on the Protection of Research Data and Recommendations for Access and Usage , 2013 .

[51]  Rakesh Chandra Balabantaray,et al.  Text normalization of code mix and sentiment analysis , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[52]  Ana Fernández-Montraveta,et al.  The SenSem Corpus: an annotated corpus for Spanish and Catalan with information about aspectuality, modality, polarity and factuality , 2014 .

[53]  Antske Fokkens,et al.  Building event-centric knowledge graphs from news , 2016, J. Web Semant..

[54]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[55]  Haibin Liu,et al.  Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach , 2015, AMIA.

[56]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[57]  Thomas Margoni,et al.  Legal Aspects of Open Access to Publicly Funded Research , 2015 .

[58]  J. Revesz,et al.  Trade-Related Aspects of Intellectual Property Rights , 1999 .

[59]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[60]  Andrew Bate,et al.  Drug-drug interactions - a preventable patient safety issue? , 2008, British journal of clinical pharmacology.

[61]  Paramita Mirza FBK-HLT-time : a complete Italian Temporal Processing system for EVENTI-Evalita 2014 , 2014 .

[62]  Yannick Versley,et al.  BART: A Modular Toolkit for Coreference Resolution , 2008, ACL.

[63]  Shana Poplack,et al.  Sometimes I'll Start a Sentence in Spanish Y Termino En Espanol: toward a Typology of Code-switching 1 , 2010 .

[64]  Ajai Kumar Jain,et al.  AnglaHindi: an English to Hindi machine-aided translation system , 2003, MTSUMMIT.

[65]  Egoitz Laparra,et al.  Predicate Matrix: automatically extending the semantic interoperability between predicate resources , 2016, Lang. Resour. Evaluation.

[66]  Shruti Pal Mother Tongue Influence on Spoken English , 2011 .

[67]  Nigam H. Shah,et al.  Mining clinical text for signals of adverse drug-drug interactions , 2014, J. Am. Medical Informatics Assoc..

[68]  Pushpak Bhattacharyya,et al.  A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[69]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[70]  Yannick Rochat,et al.  Closeness Centrality Extended to Unconnected Graphs: the Harmonic Centrality Index , 2009 .

[71]  P. Buitelaar,et al.  Exploring Your Research : Sprinkling some Saffron on Semantic Web Dog Food , 2010 .

[72]  Gabriela Vulcu,et al.  Reconciling Heterogeneous Descriptions of Language Resources , 2015, LDL@IJCNLP.

[73]  Pierre Nugues,et al.  A High-Performance Syntactic and Semantic Dependency Parser , 2010, COLING.

[74]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[75]  Egoitz Laparra,et al.  A Multilingual Predicate Matrix , 2016, LREC.

[76]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[77]  Isabel Segura-Bedmar,et al.  The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts , 2011 .

[78]  Michael Gertz,et al.  HeidelTime: Tuning English and Developing Spanish Resources for TempEval-3 , 2013, *SEMEVAL.

[79]  Stelios Piperidis The META-SHARE Language Resources Sharing Infrastructure: Principles, Challenges, Solutions , 2012, LREC.

[80]  Lucie Guibault,et al.  Why Cherry-Picking Never Leads to Harmonisation The Case of the Limitations on Copyright under Directive 2001/29/EC , 2010 .

[81]  Patrick Paroubek,et al.  Rediscovering 15 + 2 years of discoveries in language resources and evaluation , 2016, Lang. Resour. Evaluation.

[82]  Egoitz Laparra,et al.  ESO: a Frame based Ontology for Events and Implied Situations , 2015 .

[83]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[84]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[85]  Michael C Whitlock,et al.  Data Archiving , 2010, The American Naturalist.

[86]  Ulf Leser,et al.  A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature , 2010, PLoS Comput. Biol..

[87]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[88]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .

[89]  R. Venkata Rao,et al.  Teaching-learning-based optimization: A novel method for constrained mechanical design optimization problems , 2011, Comput. Aided Des..

[90]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[91]  Thomas Margoni,et al.  Open Access, Open Science, Open Society , 2016, ELPUB.

[92]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[93]  Wim Peters Establishing Interoperability Between Linguistic and Terminological Ontologies , 2013, New Trends of Research in Ontologies and Lexical Resources.

[94]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[95]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[96]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[97]  Mary Baker,et al.  The LOCKSS peer-to-peer digital preservation system , 2005, TOCS.

[98]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[99]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[100]  Ludo Waltman,et al.  PageRank-Related Methods for Analyzing Citation Networks , 2014 .

[101]  Alberto Lavelli,et al.  FBK-irst : A Multi-Phase Kernel Based Approach for Drug-Drug Interaction Detection and Classification that Exploits Linguistic Information , 2013, *SEMEVAL.

[102]  Asunción Gómez-Pérez,et al.  Interchanging lexical resources on the Semantic Web , 2012, Language Resources and Evaluation.

[103]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[104]  Heng-Yi Wu,et al.  Extraction of Pharmacokinetic Evidence of Drug–Drug Interactions from the Literature , 2014, PloS one.

[105]  Víctor Rodríguez-Doncel,et al.  RDF Representation of Licenses for Language Resources , 2015, LDL@IJCNLP.

[106]  Chantal van Son,et al.  MEANTIME, the NewsReader Multilingual Event and Time Corpus , 2016, LREC.