Linking Educational Resources on Data Science

The availability of massive datasets in genetics, neuroimaging, mobile health, and other subfields of biology and medicine promises new insights but also poses significant challenges. To realize the potential of big data in biomedicine, the National Institutes of Health launched the Big Data to Knowledge (BD2K) initiative, funding several centers of excellence in biomedical data analysis and a Training Coordinating Center (TCC) tasked with facilitating online and inperson training of biomedical researchers in data science. A major initiative of the BD2K TCC is to automatically identify, describe, and organize data science training resources available on the Web and provide personalized training paths for users. In this paper, we describe the construction of ERuDIte, the Educational Resource Discovery Index for Data Science, and its release as linked data. ERuDIte contains over 11,000 training resources including courses, video tutorials, conference talks, and other materials. The metadata for these resources is described uniformly using Schema.org. We use machine learning techniques to tag each resource with concepts from the Data Science Education Ontology, which we developed to further describe resource content. Finally, we map references to people and organizations in learning resources to entities in DBpedia, DBLP, and ORCID, embedding our collection in the web of linked data. We hope that ERuDIte will provide a framework to foster open linked educational resources on the Web.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[3]  Bich-Liên Doan,et al.  The semantic Web for learning resources , 2003, Proceedings 3rd IEEE International Conference on Advanced Technologies.

[4]  Lucila Ohno-Machado,et al.  NIH's Big Data to Knowledge initiative and the advancement of biomedical informatics , 2014, J. Am. Medical Informatics Assoc..

[5]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[6]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[7]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[8]  Bernardo Pereira Nunes,et al.  Linked Data in Education: A Survey and a Synthesis of Actual Research and Future Challenges , 2018, IEEE Transactions on Learning Technologies.

[9]  Kristina Lerman,et al.  BD2K ERuDIte: the Educational Resource Discovery Index for Data Science , 2017, WWW.

[10]  Lora Aroyo,et al.  The New Challenges for E-learning: The Educational Semantic Web , 2004, J. Educ. Technol. Soc..

[11]  Daniela Giordano,et al.  Interlinking educational resources and the web of data: A survey of challenges and approaches , 2013, Program.

[12]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[13]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[14]  Kristina Lerman,et al.  Semi-automatically Mapping Structured Sources into the Semantic Web , 2012, ESWC.

[15]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[16]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[17]  T. Anderson,et al.  The Educational Semantic Web: Visioning and Practicing the Future of Education , 2004 .

[18]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[19]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .