Big Scholarly Data: A Survey

With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new perspective. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Second, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a comprehensive review of this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.

[1]  Barry Bozeman,et al.  The Impact of Research Collaboration on Scientific Productivity , 2005 .

[2]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[3]  Chun-Ting Zhang,et al.  A novel triangle mapping technique to study the h-index based citation distribution , 2013, Scientific Reports.

[4]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[5]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[6]  Wolfgang Glänzel,et al.  Domesticity and internationality in co-authorship, references and citations , 2005, Scientometrics.

[7]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[8]  Stefano Ferilli,et al.  An Integrated Management System for Multimedia Digital Library , 2014, IRCDL.

[9]  Feng Xia,et al.  ACRec: a co-authorship based random walk model for academic collaboration recommendation , 2014, WWW.

[10]  Giseli Rabello Lopes,et al.  Collaboration Recommendation on Academic Social Networks , 2010, ER Workshops.

[11]  Thomas H. Davenport,et al.  Book review:Working knowledge: How organizations manage what they know. Thomas H. Davenport and Laurence Prusak. Harvard Business School Press, 1998. $29.95US. ISBN 0‐87584‐655‐6 , 1998 .

[12]  Anthony Lowrie,et al.  Academic Research Networks:: A Key to Enhancing Scholarly Standing , 2004 .

[13]  Sean M. McNee,et al.  Enhancing digital libraries with TechLens+ , 2004, JCDL.

[14]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[15]  Alan Fersht,et al.  The most influential journals: Impact Factor and Eigenfactor , 2009, Proceedings of the National Academy of Sciences.

[16]  Vladimir Batagelj,et al.  Pajek - Program for Large Network Analysis , 1999 .

[17]  Hai Jin,et al.  A Rule-Based Framework of Metadata Extraction from Scientific Papers , 2011, 2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[18]  L. Egghe,et al.  Theory and practise of the g-index , 2006, Scientometrics.

[19]  Feng Xia,et al.  Improving Smart Conference Participation Through Socially Aware Recommendation , 2014, IEEE Transactions on Human-Machine Systems.

[20]  Andreas Strotmann,et al.  The knowledge base and research front of information science 2006–2010: An author cocitation and bibliographic coupling analysis , 2014, J. Assoc. Inf. Sci. Technol..

[21]  Star X. Zhao,et al.  Power-law link strength distribution in paper cocitation networks , 2013, J. Assoc. Inf. Sci. Technol..

[22]  Feng Xia,et al.  MVCWalker: Random Walk-Based Most Valuable Collaborators Recommendation Exploiting Academic Factors , 2014, IEEE Transactions on Emerging Topics in Computing.

[23]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[24]  Nitesh V. Chawla,et al.  Can Scientific Impact Be Predicted? , 2016, IEEE Transactions on Big Data.

[25]  Mike Thelwall,et al.  ResearchGate: Disseminating, communicating, and measuring Scholarship? , 2015, J. Assoc. Inf. Sci. Technol..

[26]  Jimeng Sun,et al.  Cross-domain collaboration recommendation , 2012, KDD.

[27]  Ying Ding,et al.  Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks , 2011, J. Informetrics.

[28]  R. Tijssen,et al.  Research collaboration at a distance: Changing spatial patterns of scientific collaboration within Europe , 2010 .

[29]  Omar Almousa,et al.  Users' classification and usage-pattern identification in academic social networks , 2011, 2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[30]  Jianhua Hou,et al.  The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis , 2010, J. Assoc. Inf. Sci. Technol..

[31]  Sean M. McNee,et al.  Enhancing digital libraries with TechLens , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[32]  Nadeem Akhtar,et al.  Social Network Analysis Tools , 2014, 2014 Fourth International Conference on Communication Systems and Network Technologies.

[33]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[34]  Erwin-Christian Lovasz,et al.  Digital Library of Mechanisms , 2014 .

[35]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[36]  Carl T. Bergstrom,et al.  A Recommendation System Based on Hierarchical Clustering of an Article-Level Citation Network , 2016, IEEE Transactions on Big Data.

[37]  Mark F. Hornick,et al.  Extending Recommender Systems for Disjoint User/Item Sets: The Conference Recommendation Problem , 2012, IEEE Transactions on Knowledge and Data Engineering.

[38]  David Hawking,et al.  Panoptic Expert: Searching for experts not just for documents , 2001 .

[39]  Wolfgang Glänzel,et al.  Combining full-text analysis and bibliometric indicators. A pilot study , 2005, Scientometrics.

[40]  Weimao Ke,et al.  Dynamicity vs. effectiveness: studying online clustering for scatter/gather , 2009, SIGIR.

[41]  Cassidy R. Sugimoto,et al.  The cognitive structure of Library and Information Science: Analysis of article title words , 2011, J. Assoc. Inf. Sci. Technol..

[42]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[43]  Virginia Gewin Networking in VIVO , 2009 .

[44]  Raymond Y. K. Lau,et al.  Combining social network and semantic concept analysis for personalized academic researcher recommendation , 2012, Decis. Support Syst..

[45]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Jinsong Zhang,et al.  Full-text citation analysis: enhancing bibliometric and scientific publication ranking , 2012, CIKM.

[47]  Hanghang Tong,et al.  Guest Editorial: Big Scholar Data Discovery and Collaboration , 2017, IEEE Trans. Big Data.

[48]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[49]  Paola Batistoni,et al.  International Conference , 2001 .

[50]  Min Song,et al.  Detecting the knowledge structure of bioinformatics by mining full-text collections , 2012, Scientometrics.

[51]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010, J. Assoc. Inf. Sci. Technol..

[52]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.

[53]  Loet Leydesdorff,et al.  How are new citation-based journal indicators adding to the bibliometric toolbox? , 2009, J. Assoc. Inf. Sci. Technol..

[54]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[55]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[56]  Feng Xia,et al.  CocaRank: A Collaboration Caliber-based Method for Finding Academic Rising Stars , 2016, WWW.

[57]  Mike Thelwall,et al.  Academia.edu: Social network or Academic Network? , 2014, J. Assoc. Inf. Sci. Technol..

[58]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[59]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[60]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[61]  Richard Van Noorden Online collaboration: Scientists and the social network , 2014, Nature.

[62]  Qing Ke,et al.  Defining and identifying Sleeping Beauties in science , 2015, Proceedings of the National Academy of Sciences.

[63]  Heyan Huang,et al.  Tri-Rank: An Authority Ranking Framework in Heterogeneous Academic Networks by Mutual Reinforce , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[64]  Amr M. Tolba,et al.  Exploiting Publication Contents and Collaboration Networks for Collaborator Recommendation , 2016, PloS one.

[65]  M. Anusha,et al.  Big Data-Survey , 2016 .

[66]  Don MacMillan,et al.  Data Sharing and Discovery: What Librarians Need to Know , 2014 .

[67]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[68]  Hildrun Kretschmer,et al.  Characterizing intellectual spaces between science and technology , 2004, Scientometrics.

[69]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008 .

[70]  Min Song,et al.  Text Mining with the Stanford CoreNLP , 2014 .

[71]  C. Lee Giles,et al.  Collaboration over time: characterizing and modeling network evolution , 2008, WSDM '08.

[72]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[73]  Feng Xia,et al.  Mining advisor-advisee relationships in scholarly big data: A deep learning approach , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[74]  Farideh Osareh,et al.  Co-authorship Network Structure Analysis of Iranian Researchers’ scientific outputs from 1991 to 2013 based on the Social Science Citation Index (SSCI) , 2014 .

[75]  Ronald N. Kostoff,et al.  Citation mining: Integrating text mining and bibliometrics for research user profiling , 2001, J. Assoc. Inf. Sci. Technol..

[76]  Ying Guo,et al.  Cross-domain Scientific Collaborations prediction using citation , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[77]  Heejung Kim,et al.  Archiving research trends in LIS domain using profiling analysis , 2007, Scientometrics.

[78]  Hongyuan Zha,et al.  Co-ranking Authors and Documents in a Heterogeneous Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[79]  Feng Xia,et al.  Identifying Anomalous Citations for Objective Evaluation of Scholarly Article Impact , 2016, PloS one.

[80]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[81]  Alan L. Porter,et al.  Research profiling: Improving the literature review , 2002, Scientometrics.

[82]  Feiran Huang,et al.  PandaSearch: A fine-grained academic search engine for research documents , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[83]  Feng Xia,et al.  Scholarly paper recommendation based on social awareness and folksonomy , 2015, Int. J. Parallel Emergent Distributed Syst..

[84]  Alfred Kobsa,et al.  Expert-Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach , 2003, J. Organ. Comput. Electron. Commer..

[85]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[86]  Katy Börner,et al.  Open data and open code for big science of science studies , 2014, Scientometrics.

[87]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[88]  Loet Leydesdorff,et al.  Metaphors and Diaphors in Science Communication , 2005 .

[89]  Vincent Larivière,et al.  A lead‐lag analysis of the topic evolution patterns for preprints and publications , 2015, J. Assoc. Inf. Sci. Technol..

[90]  Ying Ding,et al.  Applying weighted PageRank to author citation networks , 2011, J. Assoc. Inf. Sci. Technol..

[91]  Reza Zafarani,et al.  Social Media Mining: An Introduction , 2014 .

[92]  A. D. Jackson,et al.  Measures for measures , 2006, Nature.

[93]  Bart De Moor,et al.  Towards mapping library and information science , 2006, Inf. Process. Manag..

[94]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[95]  Guo Zhang,et al.  Patent citation analysis: Calculating science linkage based on citing motivation , 2014, J. Assoc. Inf. Sci. Technol..

[96]  Jimmy J. Lin,et al.  Semantic Clustering of Answers to Clinical Questions , 2007, AMIA.

[97]  Paul J. Kennedy,et al.  Discovering influential authors in heterogeneous academic networks by a co-ranking method , 2013, CIKM.

[98]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[99]  Feng Xia,et al.  Socially Aware Conference Participant Recommendation With Personality Traits , 2017, IEEE Systems Journal.

[100]  Tina Eliassi-Rad,et al.  Visual Analysis of Large Heterogeneous Social Networks by Semantic and Structural Abstraction , 2006 .

[101]  Seref Sagiroglu,et al.  Big data: A review , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[102]  Marco Gori,et al.  Recommender Systems : A Random-Walk Based Approach , 2006 .

[103]  Ming Zeng,et al.  Ranking Scientific Articles by Exploiting Citations, Authors, Journals, and Time Information , 2013, AAAI.

[104]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[105]  Chao Hu,et al.  The "Small-World" Characteristic of Author Co-Words Network , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.

[106]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[107]  Feng Xia,et al.  Folksonomy based socially-aware recommendation of scholarly papers for conference participants , 2014, WWW.

[108]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[109]  Bao-Zhong Yuan,et al.  Development and Characteristic of Digital Library as a Library Branch , 2012 .

[110]  Ümit V. Çatalyürek,et al.  Fast Recommendation on Bibliographic Networks , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[111]  Enrique Herrera-Viedma,et al.  A quality based recommender system to disseminate information in a university digital library , 2014, Inf. Sci..

[112]  Jan-Ming Ho,et al.  Using Web-Mining for Academic Measurement and Scholar Recommendation in Expert Finding System , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[113]  Ying Ding,et al.  Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other , 2012, J. Assoc. Inf. Sci. Technol..

[114]  Michael J. Muller,et al.  Motivations for social networking at work , 2008, CSCW.

[115]  Carl T. Bergstrom,et al.  The Eigenfactor™ Metrics , 2008, The Journal of Neuroscience.

[116]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[117]  Ralf Klamma,et al.  Enhancing Academic Event Participation with Context-aware and Social Recommendations , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[118]  Matjaz Perc,et al.  Inheritance patterns in citation networks reveal scientific memes , 2014, ArXiv.

[119]  Sergei Maslov,et al.  Ranking scientific publications using a model of network traffic , 2006, ArXiv.

[120]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[121]  Heather A. Piwowar,et al.  Altmetrics: Value all research products , 2013, Nature.

[122]  Charu C. Aggarwal,et al.  An Introduction to Text Mining , 2022 .

[123]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[124]  W. Glänzel,et al.  Analysing Scientific Networks Through Co-Authorship , 2004 .

[125]  C. Lee Giles,et al.  Searching the Web: general and scientific information access , 1999, First IEEE/POPOV Workshop on Internet Technologies and Services. Proceedings (Cat. No.99EX391).

[126]  Hamid Beigy,et al.  Expertise Finding in Bibliographic Network: Topic Dominance Learning Approach , 2014, IEEE Transactions on Cybernetics.

[127]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[128]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[129]  Cassidy R. Sugimoto,et al.  P-Rank: An indicator measuring prestige in heterogeneous scholarly networks , 2011, J. Assoc. Inf. Sci. Technol..

[130]  Anne Sigogneau,et al.  Cross-disciplinary research: co-evaluation and co-publication practices of the CNRS laboratories , 2005 .

[131]  R. Veugelers,et al.  R&D Cooperation between Firms and Universities: Some Empirical Evidence from Belgian Manufacturing , 2003 .

[132]  Duen-Ren Liu,et al.  Integrating expert profile, reputation and link analysis for expert finding in question-answering websites , 2013, Inf. Process. Manag..

[133]  Ying Ding,et al.  Weighted citation: An indicator of an article's prestige , 2010, J. Assoc. Inf. Sci. Technol..

[134]  Lise Getoor,et al.  FutureRank: Ranking Scientific Articles by Predicting their Future PageRank , 2009, SDM.

[135]  Jason Priem Scholarship: Beyond the paper , 2013, Nature.

[136]  Feng Xia,et al.  Recommendation : Exploiting Common Author Relations and Historical Preferences , 2016 .

[137]  Su Yan,et al.  LeeDeo: Web-Crawled Academic Video Search Engine , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[138]  Feng Xia,et al.  Who are the rising stars in academia? , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[139]  Sean M. McNee,et al.  On the recommending of citations for research papers , 2002, CSCW '02.

[140]  Giseli Rabello Lopes,et al.  Using link semantics to recommend collaborations in academic social networks , 2013, WWW.

[141]  Weiguo Fan,et al.  ExpertRank: A topic-aware expert finding algorithm for online knowledge communities , 2013, Decis. Support Syst..

[142]  T. Bonavía-Martín,et al.  Dimensions of scientific collaboration and its contribution to the academic research groups§ scientific quality , 2009 .

[143]  Feng Xia,et al.  Can academic conferences promote research collaboration? , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).