Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals

Machine learning community is not only interested in maximizing classification accuracy, but also in minimizing the distances between the actual and the predicted class. Some ideas, like the cost-sensitive learning approach, are proposed to face this problem. In this paper, we propose two greedy wrapper forward cost-sensitive selective naive Bayes approaches. Both approaches readjust the probability thresholds of each class to select the class with the minimum-expected cost. The first algorithm (CS-SNB-Accuracy) considers adding each variable to the model and measures the performance of the resulting model on the training data. The variable that most improves the accuracy, that is, the percentage of well classified instances between the readjusted class and actual class, is permanently added to the model. In contrast, the second algorithm (CS-SNB-Cost) considers adding variables that reduce the misclassification cost, that is, the distance between the readjusted class and actual class. We have tested our algorithms on the bibliometric indices prediction area. Considering the popularity of the well-known h-index, we have researched and built several prediction models to forecast the annual increase of the h-index for Neurosciences journals in a four-year time horizon. Results show that our approaches, particularly CS-SNB-Accuracy, achieved higher accuracy values than the analyzed cost-sensitive classifiers and Bayesian classifiers. Furthermore, we also noted that the CS-SNB-Cost always achieved a lower average cost than all analyzed cost-sensitive and cost-insensitive classifiers. These cost-sensitive selective naive Bayes approaches outperform the selective naive Bayes in terms of accuracy and average cost, so the cost-sensitive learning approach could be also applied in different probabilistic classification approaches.

[1]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[2]  Concha Bielza,et al.  Predicting citation count of Bioinformatics papers within four years of publication , 2009, Bioinform..

[3]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[4]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[5]  Francisco Herrera,et al.  q2-Index: Quantitative and qualitative evaluation based on the number and impact of papers in the Hirsch core , 2010, J. Informetrics.

[6]  Koby Crammer,et al.  Pranking with Ranking , 2001, NIPS.

[7]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[8]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[9]  Jaime S. Cardoso,et al.  Learning to Classify Ordinal Data: The Data Replication Method , 2007, J. Mach. Learn. Res..

[10]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  Alexander von Eye,et al.  Forecasting trends of development of psychology from a bibliometric perspective , 2011, Scientometrics.

[12]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[13]  Leo Egghe,et al.  Dynamic h-index: The Hirsch index in function of time , 2007, J. Assoc. Inf. Sci. Technol..

[14]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[15]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[16]  Gerhard Widmer,et al.  Prediction of Ordinal Classes Using Regression Trees , 2001, Fundam. Informaticae.

[17]  Mônica G. Campiteli,et al.  An index to quantify an individual's scientific research valid across disciplines , 2005 .

[18]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Sotiris B. Kotsiantis,et al.  A Cost Sensitive Technique for Ordinal Classification Problems , 2004, SETN.

[21]  Leo Egghe,et al.  An informetric model for the Hirsch-index , 2006, Scientometrics.

[22]  Leo Egghe Dynamic h-index: The Hirsch index in function of time: Brief Communication , 2007 .

[23]  Jan C. Bioch,et al.  Decision trees for ordinal classification , 2000, Intell. Data Anal..

[24]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[25]  Mônica G. Campiteli,et al.  Is it possible to compare researchers with different scientific interests? , 2006, Scientometrics.

[26]  Francisco Herrera,et al.  h-Index: A review focused in its variants, computation and standardization for different scientific fields , 2009, J. Informetrics.

[27]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[28]  Lutz Bornmann,et al.  Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine , 2008, J. Assoc. Inf. Sci. Technol..

[29]  Johannes Fürnkranz,et al.  Pairwise Classification as an Ensemble Technique , 2002, ECML.

[30]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[31]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[32]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[33]  Oguz K. Baskurt,et al.  Time series analysis of publication counts of a university: what are the implications? , 2011, Scientometrics.

[34]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[35]  Yannis Manolopoulos,et al.  Generalized Hirsch h-index for disclosing latent facts in citation networks , 2007, Scientometrics.

[36]  Leo Egghe,et al.  The Hirsch index and related impact measures , 2010, Annu. Rev. Inf. Sci. Technol..

[37]  Ian Witten,et al.  Data Mining , 2000 .

[38]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[39]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[40]  Ronald Rousseau,et al.  The power law model and total career h-index sequences , 2008, J. Informetrics.

[41]  Sotiris B. Kotsiantis Local Ordinal Classification , 2006, AIAI.

[42]  F. J. Cabrerizoa,et al.  q 2-Index : Quantitative and qualitative evaluation based on the number and impact of papers in the Hirsch core , 2009 .

[43]  L. Egghe An improvement of the h-index: the g-index , 2006 .

[44]  Klaus Obermayer,et al.  Regression Models for Ordinal Data: A Machine Learning Approach , 1999 .

[45]  Amnon Shashua,et al.  Ranking with Large Margin Principle: Two Approaches , 2002, NIPS.

[46]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[47]  Richard S. J. Tol,et al.  Rational (successive) h-indices: An application to economics in the Republic of Ireland , 2008, Scientometrics.

[48]  Kai Ming Ting,et al.  Inducing Cost-Sensitive Trees via Instance Weighting , 1998, PKDD.

[49]  Ling Li,et al.  Reduction from Cost-Sensitive Ordinal Ranking to Weighted Binary Classification , 2012, Neural Computation.

[50]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[51]  Qiang Yang,et al.  Decision trees with minimal costs , 2004, ICML.

[52]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[53]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[54]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[55]  Concha Bielza,et al.  Predicting the h-index with cost-sensitive naive Bayes , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[56]  Francisco Herrera,et al.  hg-index: a new index to characterize the scientific output of researchers based on the h- and g-indices , 2010, Scientometrics.

[57]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[58]  Victor S. Sheng,et al.  Roulette Sampling for Cost-Sensitive Learning , 2007, ECML.