Community detection in sequence similarity networks based on attribute clustering

Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs, for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the “ground truth” community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[5]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[6]  Mia Hubert,et al.  An adjusted boxplot for skewed distributions , 2008, Comput. Stat. Data Anal..

[7]  A K Hartmann,et al.  Finite-temperature local protein sequence alignment: percolation and free-energy distribution. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Gerson Zaverucha,et al.  Evaluation and improvements of clustering algorithms for detecting remote homologous protein families , 2015, BMC Bioinformatics.

[9]  A. Sali,et al.  Comparison of human solute carriers , 2010, Protein science : a publication of the Protein Society.

[10]  Jari Saramäki,et al.  Exploring temporal networks with greedy walks , 2015, ArXiv.

[11]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[12]  Barbora Micenková,et al.  Clustering attributed graphs: Models, measures and methods , 2015, Network Science.

[13]  Inanç Birol,et al.  Genomic analysis of a rare human tumor , 2010, BMC Bioinformatics.

[14]  Heidi J. Imker,et al.  Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks. , 2015, Biochimica et biophysica acta.

[15]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[16]  Renaud Lambiotte,et al.  Line graphs of weighted networks for overlapping communities , 2010 .

[17]  Marc Barthelemy,et al.  Spatial Correlations in Attribute Communities , 2011, PloS one.

[18]  Jure Leskovec,et al.  Overlapping Communities Explain Core–Periphery Organization of Networks , 2014, Proceedings of the IEEE.

[19]  Marc Barthelemy,et al.  Spatial Networks , 2010, Encyclopedia of Social Network Analysis and Mining.

[20]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[21]  C. Putonti,et al.  Where does Neisseria acquire foreign DNA from: an examination of the source of genomic and pathogenic islands and the evolution of the Neisseria genus , 2013, BMC Evolutionary Biology.

[22]  Patricia C. Babbitt,et al.  New Insights about Enzyme Evolution from Large Scale Studies of Sequence and Structure Relationships* , 2014, The Journal of Biological Chemistry.

[23]  Kay Nieselt,et al.  Pan-Tetris: an interactive visualisation for Pan-genomes , 2015, BMC Bioinformatics.

[24]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[25]  Dorothea Emig,et al.  Partitioning biological data with transitivity clustering , 2010, Nature Methods.

[26]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[27]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[28]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[29]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[30]  Sylvain Peyronnet,et al.  On the Evaluation Potential of Quality Functions in Community Detection for Different Contexts , 2015, NetSci-X.

[31]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[32]  Eric Bapteste,et al.  EGN: a wizard for construction of gene and genome similarity networks , 2013, BMC Evolutionary Biology.

[33]  M. Hubert,et al.  A Robust Measure of Skewness , 2004 .

[34]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[36]  Michael A. Hicks,et al.  The Structure–Function Linkage Database , 2013, Nucleic Acids Res..

[37]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[38]  Tamás Nepusz,et al.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale , 2010, BMC Bioinformatics.

[39]  J. Söding,et al.  A vocabulary of ancient peptides at the origin of folded proteins , 2015, eLife.

[40]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[41]  Thomas E. Ferrin,et al.  Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies , 2009, PloS one.

[42]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[43]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[44]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[45]  J. McInerney,et al.  A Pluralistic Account of Homology: Adapting the Models to the Data , 2013, Molecular biology and evolution.

[46]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.