Seqfam: A python package for analysis of Next Generation Sequencing DNA data in families

This article introduces seqfam , a python package which is primarily designed for analysing next generation sequencing (NGS) DNA data from families with known pedigree information in order to identify rare variants that are potentially causal of a disease/trait of interest. It uses the popular and versatile Pandas library, and can be straightforwardly integrated into existing analysis code/pipelines. Seqfam can be used to verify pedigree information, to perform Monte Carlo gene dropping, to undertake regression-based gene burden testing, and to identify variants which segregate by affection status in families via user-defined pattern of occurrence rules. Additionally, it can generate scripts for running analyses in a “MapReduce pattern” on a computer cluster, something which is usually desirable in NGS data analysis and indeed “big data” analysis in general. This article summarises how seqfam’s main user functions work and motivates their use. It also provides explanatory context for example scripts and data included in the package which demonstrate use cases. With respect to verifying pedigree information, software exists for efficiently calculating kinship coefficients, so seqfam performs the necessary extra steps of mapping pedigrees and kinship coefficients to expected and observed degrees of relationship respectively. Gene dropping and the application of variant pattern of occurrence rules in families can provide evidence for a variant being causal. The authors are unaware of other software which performs these tasks in familial cohorts, so seqfam fulfils this need. Gene burden rather than single marker tests are often used to detect rare causal variants due to greater power. Seqfam may be an attractive alternative to existing gene burden testing software due to its flexibility, particularly in grouping and aggregating variants.

[1]  Ingo Ruczinski,et al.  Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives , 2014, Bioinform..

[2]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[3]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[4]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[5]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[6]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[7]  Lee-Jen Wei,et al.  Pooled Association Tests for Rare Variants in Exon-Resequencing Studies , 2010 .

[8]  Oliver A. Ryder,et al.  Pedigree analysis by computer simulation , 1986 .

[9]  G. Lettre,et al.  Rare variant association studies: considerations, challenges and opportunities , 2015, Genome Medicine.

[10]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[11]  E. Zeggini,et al.  An Evaluation of Statistical Approaches to Rare Variant Analysis in Genetic Association Studies , 2009, Genetic epidemiology.

[12]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[13]  Xiaowei Zhan,et al.  RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data , 2016, Bioinform..

[14]  Brent S. Pedersen,et al.  Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy , 2016, bioRxiv.