The Computer Science Ontology: A Comprehensive Automatically-Generated Taxonomy of Research Areas

Ontologies of research areas are important tools for characterizing, exploring, and analyzing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 14K topics and 162K semantic relationships. It was created by applying the Klink-2 algorithm on a very large data set of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO, we have also released the CSO Classifier, a tool for automatically classifying research papers, and the CSO Portal, a Web application that enables users to download, explore, and provide granular feedback on CSO. Users can use the portal to navigate and visualize sections of the ontology, rate topics and relationships, and suggest missing ones. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various research communities engaged with scholarly data.

[1]  Luís M. A. Bettencourt,et al.  Scientific discovery and topological transitions in collaboration networks , 2009, J. Informetrics.

[2]  Harith Alani,et al.  Semantic Sentiment Analysis of Twitter , 2012, SEMWEB.

[3]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[4]  Jochen Dörre,et al.  The TaxGen Framework: Automating the Generation of a Taxonomy for a Large Document Collection , 1999, HICSS.

[5]  Enrico Motta,et al.  Classifying Research Papers with the Computer Science Ontology , 2018, International Semantic Web Conference.

[6]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[7]  Richard S. Zemel,et al.  The Toronto Paper Matching System: An automated paper-reviewer assignment system , 2013 .

[8]  Beatrice Cherrier,et al.  Classifying Economics: A History of the JEL Codes , 2015 .

[9]  Enrico Motta,et al.  The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas , 2018, SEMWEB.

[10]  Marta Sabou,et al.  Dynamic Integration of Multiple Evidence Sources for Ontology Learning , 2012, J. Inf. Data Manag..

[11]  Francesco Osborne,et al.  Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction Based on Innovation-Adoption Priors , 2016, EKAW.

[12]  Christoph Lange,et al.  Towards a Knowledge Graph Representing Research Findings by Semantifying Survey Articles , 2017, TPDL.

[13]  Stuart E. Middleton,et al.  Ontology-based Recommender Systems , 2004, Handbook on Ontologies.

[14]  Bulletin of the medical library association april 1985. , 1985, Bulletin of the Medical Library Association.

[15]  Enrico Motta,et al.  The Evolution of IJHCS and CHI: A Quantitative Analysis , 2019, Int. J. Hum. Comput. Stud..

[16]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[17]  Jiayu Tang,et al.  Examining the Limits of Crowdsourcing for Relevance Assessment , 2013, IEEE Internet Computing.

[18]  P. Buitelaar,et al.  Exploring Your Research : Sprinkling some Saffron on Semantic Web Dog Food , 2010 .

[19]  Hao Ma,et al.  A Web-scale system for scientific knowledge exploration , 2018, ACL.

[20]  Enrico Motta,et al.  Exploring Scholarly Data with Rexplore , 2013, International Semantic Web Conference.

[21]  Enrico Motta,et al.  Pragmatic Ontology Evolution: Reconciling User Requirements and Application Performance , 2018, SEMWEB.

[22]  Enrico Motta,et al.  Exploring Research Trends with Rexplore , 2013, D Lib Mag..

[23]  Enrico Motta,et al.  Mining Semantic Relations between Research Areas , 2012, SEMWEB.

[24]  Enrico Motta,et al.  A decade of Semantic Web research through the lenses of a mixed methods approach , 2020, Semantic Web.

[25]  John F. Sowa,et al.  Building large knowledge-based systems: Representation and inference in the cyc project: D.B. Lenat and R.V. Guha , 1993 .

[26]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[27]  Henry Muccini,et al.  Reducing the Effort for Systematic Reviews in Software Engineering , 2019, Data Sci..

[28]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[29]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[30]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[31]  Enrico Motta,et al.  Ontology-Based Recommendation of Editorial Products , 2018, International Semantic Web Conference.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Li Ding,et al.  Using Ontologies in the Semantic Web: A Survey , 2005, Ontologies.

[34]  Rajiv Kishore,et al.  Ontologies: A Handbook of Principles, Concepts and Applications in Information Systems , 2007, Ontologies.

[35]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[36]  Enrico Motta,et al.  Forecasting the Spreading of Technologies in Research Communities , 2017, K-CAP.

[37]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[38]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[39]  Enrico Motta,et al.  Automatic Classification of Springer Nature Proceedings with Smart Topic Miner , 2016, SEMWEB.

[40]  Enrico Motta,et al.  AUGUR: Forecasting the Emergence of New Research Topics , 2018, JCDL.

[41]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[42]  Francesco Osborne,et al.  The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles , 2019, TPDL.

[43]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[44]  Mark A. Musen,et al.  Crowdsourcing the Verification of Relationships in Biomedical Ontologies , 2013, AMIA.

[45]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[46]  Francesco Osborne,et al.  Improving Editorial Workflow and Metadata Quality at Springer Nature , 2019, SEMWEB.

[47]  Lawrence Hunter,et al.  KaBOB: ontology-based semantic integration of biomedical databases , 2015, BMC Bioinformatics.

[48]  Enrico Motta,et al.  Identifying Diachronic Topic-Based Research Communities by Clustering Shared Research Trajectories , 2014, ESWC.