Big Data Bioinformatics

Recent technological advances allow for high throughput profiling of biological systems in a cost‐efficient manner. The low cost of data generation is leading us to the “big data” era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both “machine learning” algorithms as well as “unsupervised” and “supervised” examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia. J. Cell. Physiol. 229: 1896–1900, 2014. © 2014 Wiley Periodicals, Inc.

[1]  Timothy J. Durham,et al.  Systematic analysis of chromatin state dynamics in nine human cell types , 2011, Nature.

[2]  Kevin Y. Yip,et al.  Understanding transcriptional regulation by integrative analysis of transcription factor binding data , 2012, Genome research.

[3]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[4]  O. Troyanskaya,et al.  Defining cell-type specificity at the transcriptional level in human disease , 2013, Genome research.

[5]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[6]  M. Gerstein,et al.  Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells , 2011, Nucleic acids research.

[7]  Casey S. Greene,et al.  PILGRM: an interactive data-driven discovery platform for expert biologists , 2011, Nucleic Acids Res..

[8]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..

[11]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[12]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[13]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[14]  Julia A. Lasserre,et al.  Histone modification levels are predictive for gene expression , 2010, Proceedings of the National Academy of Sciences.

[15]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[16]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[17]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[20]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[21]  Alan M. Moses,et al.  In vivo enhancer analysis of human conserved non-coding sequences , 2006, Nature.

[22]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[23]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..