论文信息 - Protein classification with imbalanced data

Protein classification with imbalanced data

Generally, protein classification is a multi‐class classification problem and can be reduced to a set of binary classification problems, where one classifier is designed for each class. The proteins in one class are seen as positive examples while those outside the class are seen as negative examples. However, the imbalanced problem will arise in this case because the number of proteins in one class is usually much smaller than that of the proteins outside the class. As a result, the imbalanced data cause classifiers to tend to overfit and to perform poorly in particular on the minority class.

[1] Xing-Ming Zhao,et al. Classifying protein sequences using hydropathy blocks , 2006, Pattern Recognit..

[2] Kuo-Chen Chou,et al. Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[3] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[4] Ambuj K. Singh,et al. Automated protein classification using consensus decision , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[5] David M. J. Tax,et al. One-class classification , 2001 .

[6] David G. Stork,et al. Computer Manual in MATLAB to Accompany Pattern Classification, Second Edition , 2004 .

[7] Frances M. G. Pearl,et al. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[8] Li Liao,et al. Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[9] David W. Hosmer,et al. Applied Logistic Regression , 1991 .

[10] Tim J. P. Hubbard,et al. SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[11] GuoHongyu,et al. Learning from imbalanced data sets with boosting and data generation , 2004 .

[12] Thomas L. Madden,et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[13] Nitesh V. Chawla,et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[14] Rohini K. Srihari,et al. Feature selection for text categorization on imbalanced data , 2004, SKDD.

[15] Adam Kowalczyk,et al. Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[16] S. Henikoff,et al. Position-based sequence weights. , 1994, Journal of molecular biology.