Fuzzy clustering of CPP family in plants with evolution and interaction analyses

BackgroundTranscription factors have been studied intensively because they play an important role in gene expression regulation. However, the transcription factors in the CPP family (cystein-rich polycomb-like protein), compared with other transcription factor families, have not received sufficient attention, despite their wide prevalence in a broad spectrum of species, from plants to animals. The total number of known CPP transcription factors in plants is 111 from 16 plants, but only 2 of them have been studied so far, namely TSO1 and CPP1 in Arabidopsis thaliana and soybean, respectively.MethodsIn this work, to study their functions, we applied the fuzzy clustering method to all plant CPP transcription factors. The feature vector of each protein sequence for the fuzzy clustering method is encoded by the short length peptides and the combination of functional domain models.Results and conclusionsWith the fuzzy clustering method, all plant CPP transcription factors are grouped into two subfamilies. A systems approach, including Expressed Sequence Tag analysis, evolutionary analysis, protein-protein interaction network analysis and co-expression analysis, is employed to validate the clustering results, the results of which also indicates that the transcription factors from different subfamilies show uncorrelated responses.

[1]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[2]  B. Usadel,et al.  PlaNet: Combined Sequence and Expression Comparisons across Plant Networks Derived from Seven Species[W][OA] , 2011, Plant Cell.

[3]  S. Gaubatz,et al.  LIN54 is an essential core subunit of the DREAM/LINC complex that binds to the cdc2 promoter in a sequence‐specific manner , 2009, The FEBS journal.

[4]  Marcelo M. Brandão,et al.  AtPIN: Arabidopsis thaliana Protein Interaction Network , 2009, BMC Bioinformatics.

[5]  C. Wang,et al.  Regulation of meristem organization and cell division by TSO1, an Arabidopsis gene with cysteine-rich repeats. , 2000, Development.

[6]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[7]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[8]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[9]  Sebastian Proost,et al.  Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression , 2009, BMC Genomics.

[10]  Guang Li,et al.  AtPID: Arabidopsis thaliana protein interactome database—an integrative platform for plant systems biology , 2007, Nucleic Acids Res..

[11]  K. Larsen,et al.  CPP1, a DNA-binding protein involved in the expression of a soybean leghemoglobin c3 gene. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[15]  H. White-Cooper,et al.  Tombola, a tesmin/TSO1-family protein, regulates transcriptional activation in the Drosophila male germline and physically interacts with Always early , 2007, Development.

[16]  L. Schauser,et al.  The conserved cysteine-rich domain of a tesmin/TSO1-like protein binds zinc in vitro and TSO1 is required for both male and female fertility in Arabidopsis thaliana. , 2007, Journal of experimental botany.

[17]  Matthew D. Wilkerson,et al.  PlantGDB: a resource for comparative plant genomics , 2007, Nucleic Acids Res..

[18]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[19]  Peng Li,et al.  AtPID: the overall hierarchical functional protein interaction network interface and analytic platform for Arabidopsis , 2010, Nucleic Acids Res..

[20]  C. Gasser,et al.  TSO1 is a novel protein that modulates cytokinesis and cell expansion in Arabidopsis. , 2000, Development.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  Terri K. Attwood,et al.  PRINTS and PRINTS-S shed light on protein ancestry , 2002, Nucleic Acids Res..

[23]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[24]  E. Meyerowitz,et al.  TSO1 functions in cell division during Arabidopsis flower development. , 1997, Development.

[25]  Birgit Kersten,et al.  PlnTFDB: updated content and new features of the plant transcription factor database , 2009, Nucleic Acids Res..

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  C. Gasser,et al.  Arabidopsis TSO1 regulates directional processes in cells during floral organogenesis. , 1998, Genetics.

[28]  O. Gascuel,et al.  Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative. , 2006, Systematic biology.

[29]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[30]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[31]  Frank Nielsen,et al.  On weighting clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.