MRCRAIG: MapReduce and Ensemble Classifiers for Parallelizing Data Classification Problems

MRCRAIG: MAPREDUCE AND ENSEMBLE CLASSIFIERS FOR PARALLELIZING DATA CLASSIFICATION PROBLEMS by Glenn Jahnke In this paper, a novel technique for parallelizing data-classification problems is applied to finding genes in sequences of DNA. The technique involves various ensemble classification methods such as Bagging and Select Best. It then distributes the classifier training and prediction using MapReduce. A novel sequence classification voting algorithm is evaluated in the Bagging method, as well as compared against the Select Best method.

[1]  Xiao-Ming Xu,et al.  An empirical comparison of ensemble classification algorithms with support vector machines , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[2]  F. Collins,et al.  Shattuck lecture--medical and societal consequences of the Human Genome Project. , 1999, The New England journal of medicine.

[3]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Noah Treuhaft,et al.  Cluster I/O with River: making the fast case common , 1999, IOPADS '99.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Salvatore J. Stolfo,et al.  A Comparative Evaluation of Voting and Meta-learning on Partitioned Data , 1995, ICML.

[8]  Michael R. Brent,et al.  Eval: A software package for analysis of genome annotations , 2003, BMC Bioinformatics.

[9]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[10]  Christopher W. V. Hogue,et al.  NBLAST: a cluster variant of BLAST for NxN comparisons , 2002, BMC Bioinformatics.

[11]  Joe Armstrong,et al.  Making reliable distributed systems in the presence of software errors , 2003 .

[12]  J. Galagan,et al.  Conrad: gene prediction using conditional random fields. , 2007, Genome research.

[13]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[14]  Bernard Zenko,et al.  Is Combining Classifiers Better than Selecting the Best One , 2002, ICML.