Virulence factor prediction in Streptococcus pyogenes using classification and clustering based on microarray data

Interesting biological information as, for example, gene expression data (microarrays), can be extracted from publicly available genomic data. As a starting point in order to narrow down the great possibilities of wet lab experiments, global high throughput data and available knowledge should be used to infer biological knowledge and emit biological hypothesis. Here, based on microarray data, we propose the use of cluster and classification methods that have become very popular and are implemented in freely available software in order to predict the participation in virulence mechanisms of different proteins coded by genes of the pathogen Streptococcus pyogenes. Confidence of predictions is based on classification errors of known genes and repetitive prediction by more than three methods. A special emphasis is done on the nonlinear kernel classification methods used. We propose a list of interesting candidates that could be virulence factors or that participate in the virulence process of S. pyogenes. Biological validations should start using this list of candidates as they show similar behavior to known virulence factors.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  J. Friedman Regularized Discriminant Analysis , 1989 .

[3]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[4]  Ralf Stecking,et al.  Support vector machines for classifying and describing credit applicants: detecting typical and critical regions , 2005, J. Oper. Res. Soc..

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[7]  Susana A. Leiva-Valdebenito,et al.  Una revisión de los algoritmos de partición más comunes en el análisis de conglomerados: un estudio comparativo , 2010 .

[8]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[9]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[10]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[11]  Véronique Monnet,et al.  Role of bacterial peptidase F inferred by statistical analysis and further experimental validation , 2008, HFSP journal.

[12]  J. Musser,et al.  Longitudinal analysis of the group A Streptococcus transcriptome in experimental pharyngitis in cynomolgus macaques. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  J. Musser,et al.  A direct link between carbohydrate utilization and virulence in the major human pathogen group A Streptococcus , 2008, Proceedings of the National Academy of Sciences.

[14]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[15]  V. Nizet,et al.  Inactivation of DltA Modulates Virulence Factor Expression in Streptococcus pyogenes , 2009, PloS one.

[16]  R. Kumar,et al.  Comparative analysis of emm type pattern of Group A Streptococcus throat and skin isolates from India and their association with closely related SIC, a streptococcal virulence factor , 2008, BMC Microbiology.

[17]  Liliana López Kleine,et al.  Using multivariate methods to infer knowledge from genomic data , 2013, Int. J. Bioinform. Res. Appl..

[18]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[19]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[20]  Bertrand Clarke,et al.  Principles and Theory for Data Mining and Machine Learning , 2009 .

[21]  A. Bisno,et al.  Molecular basis of group A streptococcal virulence. , 2003, The Lancet. Infectious diseases.