Unraveling the Complexities of Life Sciences Data

The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.

[1]  Alexey I Nesvizhskii,et al.  Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20 , 2003, Journal of bacteriology.

[2]  Eugene Kolker,et al.  Statistical analysis of global gene expression data: some practical considerations. , 2004, Current opinion in biotechnology.

[3]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[4]  Winston Haynes,et al.  Bioinformatics and data-intensive scientific discovery in the beginning of the 21st century. , 2011, Omics : a journal of integrative biology.

[5]  Eugene Kolker,et al.  Designing a post-genomics knowledge ecosystem to translate pharmacogenomics into public health action , 2012, Genome Medicine.

[6]  Eugene Kolker,et al.  Risk assessment and communication tools for genotype associations with multifactorial phenotypes: the concept of "edge effect" and cultivating an ethical bridge between omics innovations and society. , 2009, Omics : a journal of integrative biology.

[7]  Eugene Kolker,et al.  Genome-environment interactions and prospective technology assessment: evolution from pharmacogenomics to nutrigenomics and ecogenomics. , 2009, Omics : a journal of integrative biology.

[8]  Robert J. Allio,et al.  CEO interview: the InnoCentive model of open innovation , 2004 .

[9]  Eugene Kolker A vision for 21st century U.S. Policy to support sustainable advancement of scientific discovery and technological innovation. , 2010, Omics : a journal of integrative biology.

[10]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[11]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[12]  Eugene Kolker,et al.  DELSA Global for “Big Data” and the Bioeconomy: Catalyzing Collective Innovation , 2012 .

[13]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[14]  Eugene Kolker,et al.  A note on the false discovery rate and inconsistent comparisons between experiments , 2008, Bioinform..

[15]  Winston Haynes,et al.  Corrigendum to “SPIRE: Systematic Protein Investigative Research Environment” [J. Proteomics 75 (1) (2011) 122–126] , 2012 .

[16]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[17]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[18]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[19]  J. Carpenter May the best analyst win. , 2011, Science.

[20]  Pelin Yilmaz,et al.  Meeting Report: “Metagenomics, Metadata and Meta-analysis” (M3) Workshop at the Pacific Symposium on Biocomputing 2010 , 2010, Standards in genomic sciences.

[21]  Gordon A Anderson,et al.  Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Michael Y. Galperin,et al.  New metrics for comparative genomics. , 2006, Current opinion in biotechnology.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Winston Haynes,et al.  IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics. , 2011, Journal of proteomics.

[25]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[26]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[27]  Yann Joly,et al.  Towards an Ecology of Collective Innovation: Human Variome Project (HVP), Rare Disease Consortium for Autosomal Loci (RaDiCAL) and Data-Enabled Life Sciences Alliance (DELSA). , 2011, Current pharmacogenomics and personalized medicine.

[28]  Eugene Kolker,et al.  Vaccines of the 21st century and vaccinomics: data-enabled science meets global health to spark collective action for vaccine innovation. , 2011, Omics : a journal of integrative biology.

[29]  Eugene Kolker,et al.  The necessity of adjusting tests of protein category enrichment in discovery proteomics , 2010, Bioinform..

[30]  Christian von Mering,et al.  eggNOG: automated construction and annotation of orthologous groups of genes , 2007, Nucleic Acids Res..

[31]  Eugene Kolker,et al.  Experiment-specific estimation of peptide identification probabilities using a randomized database. , 2007, Omics : a journal of integrative biology.

[32]  Eugene Kolker,et al.  Quantifying Protein Function Specificity in the Gene Ontology , 2010, Standards in genomic sciences.

[33]  E Kolker Editorial: global issues. , 2010, Omics : a journal of integrative biology.

[34]  Eugene Kolker,et al.  Estimating false discovery rates for peptide and protein identification using randomized databases , 2010, Proteomics.

[35]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[36]  Tatiana A. Tatusova,et al.  The National Center for Biotechnology Information's Protein Clusters Database , 2008, Nucleic Acids Res..

[37]  Winston Haynes,et al.  SPIRE: Systematic protein investigative research environment. , 2011, Journal of proteomics.

[38]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[39]  L. Smarr Quantifying your body: a how-to guide from a systems biology perspective. , 2012, Biotechnology journal.

[40]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[41]  Winston Haynes,et al.  Meta-analysis for protein identification: a case study on yeast data. , 2010, Omics : a journal of integrative biology.

[42]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[43]  Peter Tarczy-Hornoch,et al.  Validating annotations for uncharacterized proteins in Shewanella oneidensis. , 2008, Omics : a journal of integrative biology.

[44]  Chris F. Taylor,et al.  Meeting Report: BioSharing at ISMB 2010 , 2010, Standards in genomic sciences.

[45]  Eugene Kolker,et al.  H. influenzae Consortium: integrative study of H. influenzae-human interactions. , 2002, Omics : a journal of integrative biology.

[46]  P. Bork Powers and pitfalls in sequence analysis: the 70% hurdle. , 2000, Genome research.

[47]  Eugene Kolker,et al.  Randomized sequence databases for tandem mass spectrometry peptide and protein identification. , 2005, Omics : a journal of integrative biology.

[48]  Eugene Kolker OMICS: 2009, 2010, and Beyond. , 2009, Omics : a journal of integrative biology.

[49]  Eugene Kolker,et al.  Special issue on data-intensive science. , 2011, Omics : a journal of integrative biology.

[50]  Fangfang Xia,et al.  SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models , 2012, PloS one.

[51]  Vural Ozdemir,et al.  End of the Beginning and Public Health Pharmacogenomics: Knowledge in 'Mode 2' and P5 Medicine. , 2012, Current pharmacogenomics and personalized medicine.

[52]  Eugene Kolker,et al.  Experimental standards for high-throughput proteomics. , 2006, Omics : a journal of integrative biology.

[53]  Eugene Kolker,et al.  A predictive model for identifying proteins by a single peptide match , 2007, Bioinform..

[54]  Tsippi Iny Stein,et al.  In-silico human genomics with GeneCards , 2011, Human Genomics.

[55]  Eugene Kolker,et al.  Modeling sequence and function similarity between proteins for protein functional annotation , 2010, HPDC '10.

[56]  Magdalena Balazinska,et al.  Biology and data-intensive scientific discovery in the beginning of the 21st century. , 2011, Omics : a journal of integrative biology.

[57]  Eugene Kolker,et al.  Opportunities and challenges for the life sciences community. , 2012, Omics : a journal of integrative biology.

[58]  Alex Bateman,et al.  Cloud computing , 2009, Bioinform..

[59]  Michael Y. Galperin,et al.  In Silico Metabolic Model and Protein Expression of Haemophilus influenzae Strain Rd KW20 in Rich Medium. , 2004, Omics : a journal of integrative biology.

[60]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[61]  E. Kolker,et al.  LIP index for peptide classification using MS/MS and SEQUEST search via logistic regression. , 2004, Omics : a journal of integrative biology.

[62]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[63]  Judy Qiu,et al.  Communication and data-intensive science in the beginning of the 21st century. , 2011, Omics : a journal of integrative biology.

[64]  Dinesh Manocha,et al.  Technology and data-intensive science in the beginning of the 21st century. , 2011, Omics : a journal of integrative biology.

[65]  E. Kolker,et al.  Protein identification and expression analysis using mass spectrometry. , 2006, Trends in microbiology.

[66]  Rick L. Stevens,et al.  Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project , 2010, Standards in genomic sciences.

[67]  Eugene Kolker,et al.  Host Airway Proteins Interact with Staphylococcus aureus during Early Pneumonia , 2008, Infection and Immunity.

[68]  Ron Edgar,et al.  NCBI Peptidome: a new public repository for mass spectrometry peptide identifications , 2009, Nature Biotechnology.

[69]  John Wilbanks,et al.  'Omics Data Sharing , 2009, Science.

[70]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[71]  Folker Meyer,et al.  The United States of America and Scientific Research , 2010, PloS one.

[72]  Bhanu Rekapalli,et al.  Dynamics of domain coverage of the protein sequence universe , 2012, BMC Genomics.

[73]  Michael Y. Galperin,et al.  Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) , 2000, Genome Biology.

[74]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[75]  Eugene Kolker Integrative Microbiology, 2003 , 2002 .

[76]  Doron Lancet,et al.  MOPED: Model Organism Protein Expression Database , 2011, Nucleic Acids Res..

[77]  Eugene Kolker,et al.  Early Pneumonia Alterations in the Airway Proteome during Elicits Marked Staphylococcus Aureus Supplemental Material , 2008 .

[78]  Winston Haynes,et al.  Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. , 2011, Omics : a journal of integrative biology.

[79]  Eugene Kolker,et al.  Meeting report: the 2009 Westlake International Conference on Personalized Medicine. , 2009, Omics : a journal of integrative biology.

[80]  Michael Y. Galperin,et al.  Interplay of heritage and habitat in the distribution of bacterial signal transduction systems. , 2010, Molecular bioSystems.

[81]  P. Glasziou,et al.  Avoidable waste in the production and reporting of research evidence , 2009, The Lancet.

[82]  Geoffrey C. Fox,et al.  Visualizing the Protein Sequence Universe , 2012, ECMLS '12.