A survey on scholarly data: From big data perspective

Survey of big scholarly data with respect to the different phases of the big data lifecycle.Identifies the different big data tools and technologies that can be used for development of scholarly applications.Investigates research challenges and limitations specific to big scholarly data and its applications.Provides research directions and paves way towards the development of a generic and comprehensive big scholarly data platform. Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.

[1]  Peter Brusilovsky,et al.  Comprehensive personalized information access in an educational digital library , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  C. Lee Giles,et al.  What's there and what's not?: focused crawling for missing documents in digital libraries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  Chen Yang,et al.  Scientific Collaborator Recommendation in Heterogeneous Bibliographic Networks , 2015, 2015 48th Hawaii International Conference on System Sciences.

[4]  Long Wang,et al.  A Framework for Cloud-Based Large-Scale Data Analytics and Visualization: Case Study on Multiscale Climate Data , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[5]  MoonBongki,et al.  Parallel data processing with MapReduce , 2012 .

[6]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[7]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[8]  Isidro F. Aguillo Is Google Scholar useful for bibliometrics? A webometric analysis , 2012, Scientometrics.

[9]  Pasquale Lops,et al.  Content-based Recommender Systems: State of the Art and Trends , 2011, Recommender Systems Handbook.

[10]  Carl T. Bergstrom,et al.  A Recommendation System Based on Hierarchical Clustering of an Article-Level Citation Network , 2016, IEEE Transactions on Big Data.

[11]  Zhaohui Wu,et al.  Searching online book documents and analyzing book citations , 2013, ACM Symposium on Document Engineering.

[12]  Zhaohui Wu,et al.  Measuring Term Informativeness in Context , 2013, NAACL.

[13]  X. Zhu,et al.  iCARE: A framework for big data-based banking customer analytics , 2014, IBM J. Res. Dev..

[14]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[15]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[16]  HarzingAnne-Wil Microsoft Academic (Search) , 2016 .

[17]  Feng Xia,et al.  MVCWalker: Random Walk-Based Most Valuable Collaborators Recommendation Exploiting Academic Factors , 2014, IEEE Transactions on Emerging Topics in Computing.

[18]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[19]  Nitesh V. Chawla,et al.  Can Scientific Impact Be Predicted? , 2016, IEEE Transactions on Big Data.

[20]  Richard Van Noorden Open access: The true cost of science publishing , 2013, Nature.

[21]  Isaac CH Fung,et al.  Citation of non-English peer review publications – some Chinese examples , 2008, Emerging themes in epidemiology.

[22]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[23]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[24]  Jaideep Srivastava,et al.  Leveraging Web Intelligence for Finding Interesting Research Datasets , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[25]  Ludo Waltman,et al.  Visualizing Bibliometric Networks , 2014 .

[26]  Muhammad Atif Tahir,et al.  Towards cloud based big data analytics for smart future cities , 2013, 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing.

[27]  Allison Woodruff,et al.  Enhancing a digital book with a reading recommender , 2000, CHI.

[28]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[29]  Heejun Kim,et al.  Why name ambiguity resolution matters for scholarly big data research , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[30]  C. Lee Giles,et al.  Ranking experts using author-document-topic graphs , 2013, JCDL '13.

[31]  Jevin D. West,et al.  Babel: A Platform for Facilitating Research in Scholarly Article Discovery , 2016, WWW.

[32]  Yoshiyuki Takeda,et al.  Detecting emerging research fronts based on topological measures in citation networks of scientific publications , 2008 .

[33]  C. Lee Giles,et al.  Scholarly publishing in the Internet age: a citation analysis of computer science literature , 2001, Inf. Process. Manag..

[34]  R VanNoorden,et al.  Open Access: The True Cost of Science Publishing , 2013 .

[35]  Ahmed,et al.  Big-Data Processing Techniques and Their Challenges in Transport Domain , 2015 .

[36]  Anna-Karin Tötterman,et al.  Information behaviour meets social capital: a conceptual model , 2008, J. Inf. Sci..

[37]  Gunilla Widén,et al.  Explaining knowledge sharing in organizations through the dimensions of social capital , 2004, J. Inf. Sci..

[38]  Karol Zyczkowski,et al.  Citation graph, weighted impact factors and performance indices , 2009, Scientometrics.

[39]  Bill Howe,et al.  VizioMetrix: A Platform for Analyzing the Visual Information in Big Scholarly Data , 2016, WWW.

[40]  Ralf Klamma,et al.  You Never Walk Alone: Recommending Academic Events Based on Social Network Analysis , 2009, Complex.

[41]  Uday Kumar,et al.  Railway Assets: A Potential Domain for Big Data Analytics , 2015, INNS Conference on Big Data.

[42]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[43]  Madian Khabsa,et al.  Big Scholarly Data in CiteSeerX: Information Extraction from the Web , 2015, WWW.

[44]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[45]  Azadeh Shakery,et al.  Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information , 2016, Inf. Process. Manag..

[46]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[47]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[48]  Ahmad A. Kardan,et al.  A novel method based on concept map for expert finding in online communities , 2013 .

[49]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[50]  Ji Wu,et al.  Entity disambiguation to Wikipedia using collective ranking , 2016, Inf. Process. Manag..

[51]  Hoda M. O. Mokhtar,et al.  A New Approach for Scholars Matching Using Universal Quantifier Queries , 2015, 2015 IEEE World Congress on Services.

[52]  Daniel Jurafsky,et al.  Who should I cite: learning literature search models from citation behavior , 2010, CIKM.

[53]  Madian Khabsa,et al.  Entity resolution using search engine results , 2012, CIKM '12.

[54]  Judit Bar-Ilan,et al.  Citations to the “Introduction to informetrics” indexed by WOS, Scopus and Google Scholar , 2010, Scientometrics.

[55]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[56]  Bela Gipp,et al.  Citation-based Plagiarism Detection , 2014, Springer Fachmedien Wiesbaden.

[57]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[58]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[59]  Rajkumar Buyya,et al.  Big Data: Principles and Paradigms , 2016 .

[60]  Amr M. Tolba,et al.  Exploiting Publication Contents and Collaboration Networks for Collaborator Recommendation , 2016, PloS one.

[61]  Dragomir R. Radev,et al.  Coherent Citation-Based Summarization of Scientific Papers , 2011, ACL.

[62]  William H. Walters,et al.  Information Sources and Indicators for the Assessment of Journal Reputation and Impact , 2016 .

[63]  Fan Wang,et al.  A Comprehensive Survey of the Reviewer Assignment Problem , 2010, Int. J. Inf. Technol. Decis. Mak..

[64]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[65]  Dan Frankowski,et al.  Collaborative Filtering Recommender Systems , 2007, The Adaptive Web.

[66]  C. Lee Giles,et al.  Similar researcher search in academic environments , 2012, JCDL '12.

[67]  Félix de Moya Anegón,et al.  Detecting, identifying and visualizing research groups in co-authorship networks , 2010, Scientometrics.

[68]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[69]  Krzysztof Janowicz,et al.  A Linked-Data-Driven Web Portal for Learning Analytics: Data Enrichment, Interactive Visualization, and Knowledge Discovery , 2014, LAK Workshops.

[70]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[71]  Bo-Christer Björk,et al.  Scientific journal publishing: yearly volume and open access availability , 2009, Inf. Res..

[72]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[73]  Benjamin Bräutigam,et al.  Concept Hierarchy Extraction from Textbooks , 2015, DocEng.

[74]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[75]  Chaomei Chen,et al.  Grand Challenges in Measuring and Characterizing Scholarly Impact , 2016, Front. Res. Metr. Anal.

[76]  Mukesh Singhal,et al.  The Role of Cloud Computing Architecture in Big Data , 2015 .

[77]  Lior Rokach,et al.  Recommender Systems Handbook , 2010 .

[78]  Domenico Talia,et al.  Clouds for Scalable Big Data Analytics , 2013, Computer.

[79]  Omer F. Rana,et al.  International Journal of Parallel, Emergent and Distributed Systems Cosmos: towards an Integrated and Scalable Service for Analysing Social Media on Demand Cosmos: towards an Integrated and Scalable Service for Analysing Social Media on Demand , 2022 .

[80]  Mark S. Ackerman,et al.  QuME: a mechanism to support expertise finding in online help-seeking communities , 2007, UIST.

[81]  Richard N. Taylor,et al.  Automatic and versatile publications ranking for research institutions and scholars , 2007, CACM.

[82]  Takayuki Itoh,et al.  A Visualization of Research Papers Based on the Topics and Citation Network , 2015, 2015 19th International Conference on Information Visualisation.

[83]  John Yen,et al.  CV-PCR: a context-guided value-driven framework for patent citation recommendation , 2013, CIKM.

[84]  Madian Khabsa,et al.  A Web Service for Scholarly Big Data Information Extraction , 2014, 2014 IEEE International Conference on Web Services.

[85]  Dwi H. Widyantoro,et al.  System development for research map visualisation , 2015, 2015 International Conference on Electrical Engineering and Informatics (ICEEI).

[86]  Ketan K. Mane,et al.  ScienceSifter: facilitating activity awareness in collaborative research groups through focused information feeds , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[87]  C. Lee Giles,et al.  Improving algorithm search using the algorithm co-citation network , 2012, JCDL '12.

[88]  Paul T. Groth,et al.  The provenance of electronic data , 2008, CACM.

[89]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[90]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[91]  Haytham Tawfeek al Feel,et al.  Digital Library Recommender System on Hadoop , 2015, 2015 IEEE Fourth Symposium on Network Cloud Computing and Applications (NCCA).

[92]  Wenyi Huang,et al.  RefSeer: A citation recommendation system , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[93]  M. A. Qadir,et al.  Citation network visualization of CiteSeer dataset , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[94]  Stefanie Haustein,et al.  Grand challenges in altmetrics: heterogeneity, data quality and dependencies , 2016, Scientometrics.

[95]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[96]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[97]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[98]  Susan Bull,et al.  Uncertainty Representation in Visualizations of Learning Analytics for Learners: Current Approaches and Opportunities , 2015, IEEE Transactions on Learning Technologies.

[99]  Massimo Moneglia,et al.  Plagiarism Detection through Multilevel Text Comparison , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[100]  Chidchanok Lursinsap,et al.  Collaborator recommendation in interdisciplinary computer science using degrees of collaborative forces, temporal evolution of research interest, and comparative seniority status , 2015, Knowl. Based Syst..

[101]  Bela Gipp Citation-based Plagiarism Detection , 2014 .

[102]  Alfredo Cuzzocrea Provenance Research Issues and Challenges in the Big Data Era , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[103]  Wenyi Huang,et al.  Crowd-sourcing Web knowledge for metadata extraction , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[104]  Dragomir R. Radev,et al.  Using Citations to Generate surveys of Scientific Paradigms , 2009, NAACL.

[105]  Brian D. Davison,et al.  Venue Recommendation: Submitting Your Paper with Style , 2012, 2012 11th International Conference on Machine Learning and Applications.

[106]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[107]  Lillian N. Cassel,et al.  Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries , 2011, JCDL 2011.

[108]  Jöran Beel,et al.  Docear's PDF inspector: title extraction from PDF files , 2013, JCDL '13.

[109]  Jian Ma,et al.  A Multilevel Information Mining Approach for Expert Recommendation in Online Scientific Communities , 2015, Comput. J..

[110]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[111]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[112]  Annika Hinze,et al.  Hermes: a notification service for digital libraries , 2001, JCDL '01.

[113]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[114]  Lian Duan,et al.  Big data analytics and business analytics , 2015 .

[115]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[116]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[117]  Zhaohui Wu,et al.  Can back-of-the-book indexes be automatically created? , 2013, CIKM.

[118]  Charles H. Pence RLetters: A Web-Based Application for Text Analysis of Journal Articles , 2016, PloS one.

[119]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[120]  Muhammad Abdul Qadir,et al.  Document similarity detection using semantic social network analysis on RDF citation graph , 2013, 2013 IEEE 9th International Conference on Emerging Technologies (ICET).

[121]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[122]  Thomas E. Potok,et al.  Discovery & Refinement of Scientific Information via a Recommender System , 2012 .

[123]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[124]  Christoph Lange,et al.  Linked 'Big' Data: Towards a Manifold Increase in Big Data Value and Veracity , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[125]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[126]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[127]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[128]  C. Lee Giles,et al.  Scaling SeerSuite in the Cloud , 2013, 2013 IEEE International Conference on Cloud Engineering (IC2E).

[129]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[130]  Judit Bar-Ilan,et al.  Which h-index? — A comparison of WoS, Scopus and Google Scholar , 2008, Scientometrics.

[131]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[132]  Ahmad A. Kardan,et al.  A novel method for expert finding in online communities based on concept map and PageRank , 2015, Human-centric Computing and Information Sciences.

[133]  Mohsen Kahani,et al.  A New Metric for Measuring Relatedness of ScientificPapers Based on Non-Textual Features , 2012 .

[134]  Aparna S. Varde,et al.  Cloud Based Predictive Analytics: Text Classification, Recommender Systems and Decision Support , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[135]  J. Hirsch Does the h index have predictive power? , 2007, Proceedings of the National Academy of Sciences.