Improving Editorial Workflow and Metadata Quality at Springer Nature

Identifying the research topics that best describe the scope of a scientific publication is a crucial task for editors, in particular because the quality of these annotations determine how effectively users are able to discover the right content in online libraries. For this reason, Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this task to their most expert editors. These editors manually analyse all new books, possibly including hundreds of chapters, and produce a list of the most relevant topics. Hence, this process has traditionally been very expensive, time-consuming, and confined to a few senior editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-driven application that assists the Springer Nature editorial team in annotating the volumes of all books covering conference proceedings in Computer Science. Since then STM has been regularly used by editors in Germany, China, Brazil, India, and Japan, for a total of about 800 volumes per year. Over the past three years the initial prototype has iteratively evolved in response to feedback from the users and evolving requirements. In this paper we present the most recent version of the tool and describe the evolution of the system over the years, the key lessons learnt, and the impact on the Springer Nature workflow. In particular, our solution has drastically reduced the time needed to annotate proceedings and significantly improved their discoverability, resulting in 9.3 million additional downloads. We also present a user study involving 9 editors, which yielded excellent results in term of usability, and report an evaluation of the new topic classifier used by STM, which outperforms previous versions in recall and F-measure.

[1]  Bahar Sateli,et al.  Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud , 2015, PeerJ Comput. Sci..

[2]  Enrico Motta,et al.  Identifying Diachronic Topic-Based Research Communities by Clustering Shared Research Trajectories , 2014, ESWC.

[3]  C. Lee Giles,et al.  HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[4]  Gerhard Weikum,et al.  KORE: keyphrase overlap relatedness for entity disambiguation , 2012, CIKM.

[5]  Enrico Motta,et al.  Automatic Classification of Springer Nature Proceedings with Smart Topic Miner , 2016, SEMWEB.

[6]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.

[7]  Enrico Motta,et al.  Exploring Scholarly Data with Rexplore , 2013, International Semantic Web Conference.

[8]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[9]  Enrico Motta,et al.  AUGUR: Forecasting the Emergence of New Research Topics , 2018, JCDL.

[10]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[11]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[12]  David C. Roberts,et al.  Mapping the Evolution of Scientific Fields , 2009, PloS one.

[13]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[14]  Enrico Motta,et al.  The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas , 2018, SEMWEB.

[15]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[16]  Francesco Osborne,et al.  The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles , 2019, TPDL.

[17]  Andrea Giovanni Nuzzolese,et al.  Conference Linked Data: The ScholarlyData Project , 2016, SEMWEB.

[18]  Srinivasan Radhakrishnan,et al.  Analyzing Structural & Temporal Characteristics of Keyword System in Academic Research Articles , 2013, Complex Adaptive Systems.

[19]  Sören Auer,et al.  AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data , 2014, International Semantic Web Conference.

[20]  Sheron L. Decker Detection of bursty and emerging trends towards identification of researchers at the early stage of trends , 2007 .

[21]  Aiko Hibino,et al.  Trends in research foci in life science fields over the last 30 years monitored by emerging topics , 2010, Scientometrics.

[22]  Enrico Motta,et al.  Pragmatic Ontology Evolution: Reconciling User Requirements and Application Performance , 2018, SEMWEB.

[23]  Enrico Motta,et al.  Ontology-Based Recommendation of Editorial Products , 2018, International Semantic Web Conference.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Kai Eckert,et al.  What's in the proceedings? Combining publisher's and researcher's perspectives , 2014, SePublica.

[26]  Ansgar Scherp,et al.  Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text , 2018, JCDL.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Enrico Motta,et al.  Classifying Research Papers with the Computer Science Ontology , 2018, International Semantic Web Conference.

[29]  Silvio Peroni,et al.  Setting our bibliographic references free: towards open citation data , 2015, J. Documentation.

[30]  Petr Knoth,et al.  Using citation-context to reduce topic drifting on pure citation-based recommendation , 2018, RecSys.

[31]  Hao Ma,et al.  A Web-scale system for scientific knowledge exploration , 2018, ACL.