SeqHBase: a big data toolset for family based sequencing data analysis

Background Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Methods Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). Results We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. Conclusions These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.

[1]  Evan T. Geller,et al.  Patterns and rates of exonic de novo mutations in autism spectrum disorders , 2012, Nature.

[2]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[3]  Marvin Miller,et al.  Postaxial acrofacial dysostosis syndrome. , 1979, The Journal of pediatrics.

[4]  Jay Shendure,et al.  Haploinsufficiency of SF3B4, a component of the pre-mRNA spliceosomal complex, causes Nager syndrome. , 2012, American journal of human genetics.

[5]  Bradley P. Coe,et al.  Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations , 2012, Nature.

[6]  Michael R. Johnson,et al.  De novo mutations in the classic epileptic encephalopathies , 2013, Nature.

[7]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[8]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[9]  Nager Fr Anomalies of the labyrinth in the light of modern genetic theory , 1951 .

[10]  F. Alkuraya,et al.  A novel X‐linked disorder with developmental delay and autistic features , 2012, Annals of neurology.

[11]  Gianluigi Zanetti,et al.  SEAL: a distributed short read mapping and duplicate removal tool , 2011, Bioinform..

[12]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[13]  Eija Korpelainen,et al.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..

[14]  Gregory M. Cooper,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014 .

[15]  M. Urioste,et al.  New acrofacial dysostosis syndrome in 3 sibs. , 1990, American journal of medical genetics.

[16]  Boris Yamrom,et al.  The contribution of de novo coding mutations to autism spectrum disorder , 2014, Nature.

[17]  Eija Korpelainen,et al.  SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop , 2013, Bioinform..

[18]  Simon M Lin,et al.  Rodriguez syndrome with SF3B4 mutation: A severe form of Nager syndrome? , 2014, American journal of medical genetics. Part A.

[19]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[20]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[21]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[22]  John Boyle,et al.  SAMQA: error classification and validation of high-throughput sequenced read data , 2011, BMC Genomics.

[23]  De novo mutations in epileptic encephalopathies , 2013 .

[24]  GhemawatSanjay,et al.  The Google file system , 2003 .

[25]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[26]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[27]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[28]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[29]  M. Schatz,et al.  Reducing INDEL calling errors in whole genome and exome sequencing data , 2014, Genome Medicine.

[30]  J. Bautista,et al.  Life-threatening nonspherocytic hemolytic anemia in a patient with a null mutation in the PKLR gene and no compensatory PKM gene expression. , 2005, Blood.

[31]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[32]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[33]  Katsuhito Yasuno,et al.  Reduced neuron-specific expression of the TAF1 gene is associated with X-linked dystonia-parkinsonism. , 2007, American journal of human genetics.

[34]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[35]  M. Schatz,et al.  Reducing INDEL errors in whole-genome and exome sequencing , 2014 .

[36]  Ying Liu,et al.  Exome sequencing and unrelated findings in the context of complex disease research: ethical and clinical implications. , 2011, Discovery medicine.

[37]  P. Pérez de la Ossa,et al.  Red cell glycolytic enzyme disorders caused by mutations: an update. , 2009, Cardiovascular & hematological disorders drug targets.