Rediscovering 15 + 2 years of discoveries in language resources and evaluation

This paper analyzes the content of the proceedings of the Language Resources and Evaluation Conference (LREC) over the past 17 years (1998–2014), with the goal of gaining a picture of the LREC community and the topics that are most relevant to the field. We follow the methodology used in similar studies, including the survey of the IEEE ICASSP conference proceedings from 1976 to 1990, the survey of the Association of Computational Linguistics conference proceedings over 50 years, and the survey of the proceedings of the conferences contained in the ISCA Archive over 25 years (1987–2012). We expand on results originally presented at LREC 2014, but include the proceedings of LREC 2014 itself in the study together with an analysis of various citation graphs. We show the evolution over time of the number of papers and authors, including their distribution by gender and affiliation, as well as collaborations and citation patterns among authors and papers, funding sources for reported research, and plagiarism and reuse in LREC papers; results for LREC are compared with similar results for major conferences in related fields. We also consider the evolution of research topics over time and identify the authors who introduced key terms. Finally, we propose and apply a measure of a researcher’s notability and provide the results for LREC authors. The study uses NLP methods that have been published in the corpus considered in the study. In addition to providing a revealing characterization of the LRE community, the study also demonstrates the need for establishing a system for unique identification of authors, papers and other sources to facilitate this type of analysis.

[1]  Anne Cambon-Thomsen,et al.  Developing a guideline to standardize the citation of bioresources in journal articles (CoBRA) , 2015, BMC Medicine.

[2]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[3]  Hans Uszkoreit,et al.  Determining the Origin and Structure of Person Names , 2010, LREC.

[4]  Michael J. Paul,et al.  Topic Modeling of Research Fields: An Interdisciplinary Perspective , 2009, RANLP.

[5]  Alex Bavelas A Mathematical Model for Group Structures , 1948 .

[6]  Yannick Rochat,et al.  Closeness Centrality Extended to Unconnected Graphs: the Harmonic Centrality Index , 2009 .

[7]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[8]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[9]  Gil Francopoulo,et al.  Global Atlas: Proper Nouns, From Wikipedia to LMF , 2013 .

[10]  Ben Shneiderman,et al.  Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization , 2012, J. Assoc. Inf. Sci. Technol..

[11]  Enrico Motta,et al.  Exploring Scholarly Data with Rexplore , 2013, International Semantic Web Conference.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[14]  Jati K. Sengupta,et al.  Introduction to Information , 1993 .

[15]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[16]  Patrick Paroubek,et al.  Facing the Identification Problem in Language-Related Scientific Data Analysis. , 2014, LREC.

[17]  Bonnie Webber,et al.  Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries , 2012 .

[18]  Patrick Paroubek,et al.  NLP4NLP: The Cobbler's Children Won't Go Unshod , 2015, D Lib Mag..

[19]  Patrick Paroubek,et al.  Rediscovering 15 Years of Discoveries in Language Resources and Evaluation: The LREC Anthology Analysis , 2014, LREC.

[20]  Miguel-Ángel Sicilia,et al.  Entities and Identities in Research Information Systems , 2012, CRIS.

[21]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[22]  Patrick Paroubek,et al.  Rediscovering 25 years of discoveries in spoken language processing: a preliminary ISCA archive analysis , 2013, INTERSPEECH.

[23]  Patrick Paroubek,et al.  A Study of Reuse and Plagiarism in LREC papers , 2016, LREC.

[24]  Alex Bavelas,et al.  Communication Patterns in Task‐Oriented Groups , 1950 .

[25]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[26]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[27]  Patrick Paroubek,et al.  LMF Lexical Markup Framework: Francopoulo/LMF Lexical Markup Framework , 2013 .

[28]  Claudia Soria,et al.  The LRE Map. Harmonising Community Descriptions of Resources , 2012, LREC.

[29]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[30]  Gil Francopoulo,et al.  TagParser: well on the way to ISO-TC37 conformance , 2008 .

[31]  Florian Boudin TALN Archives : une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue , 2013 .