Keratin protein property based classification of mammals and non-mammals using machine learning techniques

Keratin protein is ubiquitous in most vertebrates and invertebrates, and has several important cellular and extracellular functions that are related to survival and protection. Keratin function has played a significant role in the natural selection of an organism. Hence, it acts as a marker of evolution. Much information about an organism and its evolution can therefore be obtained by investigating this important protein. In the present study, Keratin sequences were extracted from public data repositories and various important sequential, structural and physicochemical properties were computed and used for preparing the dataset. The dataset containing two classes, namely mammals (Class-1) and non-mammals (Class-0), was prepared, and rigorous classification analysis was performed. To reduce the complexity of the dataset containing 56 parameters and to achieve improved accuracy, feature selection was done using the t-statistic. The 20 best features (parameters) were selected for further classification analysis using computational algorithms which included SVM, KNN, Neural Network, Logistic regression, Meta-modeling, Tree Induction, Rule Induction, Discriminant analysis and Bayesian Modeling. Statistical methods were used to evaluate the output. Logistic regression was found to be the most effective algorithm for classification, with greater than 96% accuracy using a 10-fold cross validation analysis. KNN, SVM and Rule Induction algorithms also were found to be efficacious for classification.

[1]  J. Gillespie,et al.  High-sulphur proteins in mammalian keratins: a possible aid in classification , 1977 .

[2]  Ernst Mayr,et al.  Classifications and other ordering systems , 2002 .

[3]  Qingzhong Liu,et al.  Comparison of feature selection and classification for MALDI-MS data , 2009, BMC Genomics.

[4]  Cleveland P. Hickman,et al.  Integrated Principles of Zoology , 1970 .

[5]  Shuhong Zhao,et al.  Candidate Gene Identification Approach: Progress and Challenges , 2007, International journal of biological sciences.

[6]  Neelima Arora,et al.  Application of Kohonen maps for solving the classification puzzle in AGC kinase protein sequences , 2009, Interdisciplinary Sciences: Computational Life Sciences.

[7]  Ch. Venkateswarlu,et al.  Classification and identification of mosquito species using artificial neural networks , 2008, Comput. Biol. Chem..

[8]  Panos M. Pardalos,et al.  Decision rules for efficient classification of biological data , 2009, Optim. Lett..

[9]  Amit Kumar Banerjee,et al.  Classification and clustering analysis of pyruvate dehydrogenase enzyme based on their physicochemical properties , 2010, Bioinformation.

[10]  A. Schermer,et al.  The use of aIF, AE1, and AE3 monoclonal antibodies for the identification and classification of mammalian epithelial keratins. , 1984, Differentiation; research in biological diversity.

[11]  Amit Kumar Banerjee,et al.  TOWARDS CLASSIFYING ORGANISMS BASED ON THEIR PROTEIN PHYSICOCHEMICAL PROPERTIES USING COMPARATIVE INTELLIGENT TECHNIQUES , 2011, Appl. Artif. Intell..

[12]  T. Sun,et al.  Acidic and basic hair/nail ("hard") keratins: their colocalization in upper cortical and cuticle cells of the human hair follicle and their relationship to "soft" keratins , 1986, The Journal of cell biology.

[13]  J V Jester,et al.  Transient synthesis of K6 and K16 keratins in regenerating rabbit corneal epithelium: keratin markers for an alternative pathway of keratinocyte differentiation. , 1989, Differentiation; research in biological diversity.

[14]  Laurent Kreplak,et al.  New Aspects of the α-Helix to β-Sheet Transition in Stretched Hard α-Keratin Fibers , 2004 .

[15]  Hubert Bradford Vickery,et al.  The basic amino acids of proteins. A chemical relationship between various keratins. , 1931 .

[16]  John M. Walker,et al.  The Proteomics Protocols Handbook , 2005, Humana Press.

[17]  Vadlamani Ravi,et al.  Colon cancer prediction with genetics profiles using evolutionary techniques , 2011, Expert Syst. Appl..

[18]  E A Kogan,et al.  [Morphologic and molecular-genetic characteristics of keratinization and apoptosis in squamous cell lung carcinoma]. , 2000, Arkhiv patologii.

[19]  Neelima Arora,et al.  Exploring the Interplay of Sequence and Structural Features in Determining the Flexibility of AGC Kinase Protein Family : A Bioinformatics Approach , 2008 .

[20]  Xinghua Lu,et al.  Feature selection for fMRI-based deception detection , 2009, BMC Bioinformatics.

[21]  C. Harris,et al.  Keratin proteins in human lung carcinomas. Combined use of morphology, keratin immunocytochemistry, and keratin immunoprecipitation. , 1984, The American journal of pathology.

[22]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[23]  J. Said,et al.  Keratin proteins and carcinoembryonic antigen in lung carcinoma: an immunoperoxidase study of fifty-four cases, with ultrastructural correlations. , 1983, Human pathology.

[24]  Neelima Arora,et al.  Classification and Regression Tree (CART) Analysis forDeriving Variable Importance of Parameters InfluencingAverage Flexibility of CaMK Kinase Family , 2008 .

[25]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[26]  Harold E. Himwich,et al.  THE CARBOHYDRATE METABOLISM OF THE HEART DURING PANCREAS DIABETES , 1935 .

[27]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[28]  Neelima Arora,et al.  An In Silico Approach to Cluster CAM Kinase Protein Sequences , 2009 .

[29]  William Stafford Noble,et al.  Support vector machine , 2013 .

[30]  Jesmin Nahar,et al.  Microarray data classification using automatic SVM kernel selection. , 2007, DNA and cell biology.

[31]  A D Irvine,et al.  Human keratin diseases: the increasing spectrum of disease and subtlety of the phenotype–genotype correlation , 1999, The British journal of dermatology.

[32]  Dong-Dong Wu,et al.  Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair , 2008, BMC Evolutionary Biology.

[33]  Hichem Sahbi,et al.  A Hierarchy of Support Vector Machines for Pattern Detection , 2006, J. Mach. Learn. Res..

[34]  J. Plowman,et al.  The proteomics of keratin proteins. , 2007, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.