CADRE: A Cloud-Based Data Service for Big Bibliographic Data

Large bibliographic data sets hold the promise of revolutionizing the scientific enterprise when combined with state-of-the-science computational capabilities. Providing high-quality data services for large network datasets such as the Microsoft Academic Graph, which contains more than two billion citation links, poses significant difficulties for universities. Data systems based on the property graph model are capable of delivering efficient graph query services for large networks. However, real-life queries often combine multiple types of data models. To satisfy the needs of different user groups, we developed and deployed a cloud-based data system consisting of scalable graph and text-indexed query engines. For non-expert users, the property graph model also presents a technological barrier. To alleviate the steep learning curve, we designed an intuitive graphical user interface for query-building. For advanced users, a scalable notebook service in our platform provides a more flexible computing environments where the query results can be further analyzed. These systems form the data-backbone of the Collaborative Archive and Data Research Environment (CADRE), which provides efficient and high-quality bibliographic data services to eleven large public universities in North America.

[1]  Yadu N. Babuji,et al.  Cloud Kotta: Enabling secure and scalable data analytics in the cloud , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[2]  Marko A. Rodriguez,et al.  The Gremlin Graph Traversal Machine and Language , 2015, ArXiv.

[3]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[4]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[5]  Carl T. Bergstrom,et al.  The Science of Science , 2018, Science.

[6]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[7]  David A. Pendlebury,et al.  Web of Science as a data source for research on scientific and scholarly activity , 2020, Quantitative Science Studies.

[8]  Yuxiao Dong,et al.  A Review of Microsoft Academic Services for Science of Science Studies , 2019, Front. Big Data.

[9]  Beth Plale Big data opportunities and challenges for IR, text mining and NLP , 2013, UnstructureNLP@CIKM.

[10]  Julia Lane,et al.  Using a Remote Access Data Enclave for Data Dissemination , 2007, Int. J. Digit. Curation.

[11]  Atul Prakash,et al.  Cloud computing data capsules for non-consumptiveuse of texts , 2014, ScienceCloud '14.

[12]  Nancy Wilkins-Diehr,et al.  Science gateways today and tomorrow: positive perspectives of nearly 5000 members of the research community , 2015, Concurr. Comput. Pract. Exp..

[13]  Lars George,et al.  HBase - The Definitive Guide: Random Access to Your Planet-Size Data , 2011 .