Assessing documents' credibility with genetic programming

The concept of example credibility evaluates how much a classifier can trust an example when building a classification model. It is given by a credibility function, which is application dependent and estimated according to a series of factors that influence the credibility of the examples. Here we deal with automatic document classification and study the credibility of a document according to three factors: content, authorship and citations. We propose a genetic programming algorithm to estimate the credibility of training examples, and then add this estimation to a credibility-aware classifier. For that, we model the authorship and citation data as a complex network, and select a set of structural metrics that can be used to estimate credibility. These metrics are then merged with other content-related ones, and used as terminals for the GP. The GP was tested in a subset of the ACM-DL, and results showed that the credibility-aware classifier obtained results of micro and macroF1 from 5% to 8% better than the traditional classifiers.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  B. J. Fogg,et al.  Credibility and computing technology , 1999, CACM.

[3]  Kumar Chellapilla,et al.  Data mining using genetic programming: the implications of parsimony on generalization error , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[4]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[5]  Michael Granitzer,et al.  Blog credibility ranking by exploiting verified content , 2009, WICOW.

[6]  Saket S. R. Mengle,et al.  Using ambiguity measure feature selection algorithm for support vector machine classifier , 2008, SAC '08.

[7]  Miriam J. Metzger,et al.  Credibility for the 21st Century: Integrating Perspectives on Source, Message, and Media Credibility in the Contemporary Media Environment , 2003 .

[8]  Gisele L. Pappa,et al.  Estimating the Credibility of Examples in Automatic Document Classification , 2010, J. Inf. Data Manag..

[9]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[10]  Henriette Cramer,et al.  The effects of source credibility ratings in a cultural heritage information aggregator , 2009, WICOW.

[11]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[12]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[13]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[14]  Gisele L. Pappa,et al.  Temporally-aware algorithms for document classification , 2010, SIGIR '10.

[15]  P. Kalbfleisch Credibility for the 21st Century: Integrating Perspectives on Source, Message, and Media Credibility in the Contemporary Media Environment , 2003 .

[16]  Milos Hauskrecht,et al.  Boosting KNN text classification accuracy by using supervised term weighting schemes , 2009, CIKM.

[17]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[18]  Lalit M. Patnaik,et al.  Application of genetic programming for multicategory pattern classification , 2000, IEEE Trans. Evol. Comput..