Machine Identification of High Impact Research through Text and Image Analysis

The volume of academic paper submissions and publications is growing at an ever increasing rate. While this flood of research promises progress in various fields, the sheer volume of output inherently increases the amount of noise. We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations as a means to quickly find high impact, high quality research. Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions. Current work in the field focuses on small datasets composed of papers from individual conferences. Attempts to use similar techniques on larger datasets generally only considers excerpts of the documents such as the abstract, potentially throwing away valuable data. We rectify these issues by providing a dataset composed of PDF documents and citation counts spanning a decade of output within two separate academic domains: computer science and medicine. This new dataset allows us to expand on current work in the field by generalizing across time and academic domain. Moreover, we explore inter-domain prediction models - evaluating a classifier's performance on a domain it was not trained on - to shed further insight on this important problem.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[3]  J. E. Hirsch,et al.  The meaning of the h-index , 2014 .

[4]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[5]  A. Kulkarni,et al.  Comparisons of citations in Web of Science, Scopus, and Google Scholar for articles published in general medical journals. , 2009, JAMA.

[6]  Francisco Herrera,et al.  h-Index: A review focused in its variants, computation and standardization for different scientific fields , 2009, J. Informetrics.

[7]  Konrad Paul Kording,et al.  Future impact: Predicting scientific success , 2012, Nature.

[8]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[9]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[10]  Matthew R. Boutell,et al.  Home Interior Classification using SIFT Keypoint Histograms , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Richard Harvey,et al.  Document Retrieval Using SIFT Image Features , 2011, J. Univers. Comput. Sci..

[12]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[15]  Padmini Srinivasan,et al.  Hybrid hierarchical classifiers for categorization of medical documents , 2005, ASIST.

[16]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[17]  Daniel McNamara,et al.  Predicting High Impact Academic Papers Using Citation Network Features , 2013, PAKDD Workshops.

[18]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[19]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[20]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[21]  Scott Davidson,et al.  Twenty Years Ago Today , 2000, IEEE Des. Test Comput..

[22]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[23]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[24]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .