A Boolean Algebra for Genetic Variants

Beyond identifying genetic variants, we introduce a set of Boolean relations that allows for a comprehensive classification of the relation for every pair of variants by taking all minimal alignments into account. We present an efficient algorithm to compute these relations, including a novel way of efficiently computing all minimal alignments within the best theoretical complexity bounds. We show that for variants of the CFTR gene in dbSNP these relations are common and many non-trivial. Ultimately, we present an approach for the storing and indexing of variants in the context of a database that enables the efficient querying for all these relations.

[1]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[2]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[3]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[4]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[5]  Sri Parameswaran,et al.  Improved VCF normalization for accurate VCF comparison , 2016, Bioinform..

[6]  Claus Rick,et al.  Efficient Computation of All Longest Common Subsequences , 2000, SWAT.

[7]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[8]  Deanna M. Church,et al.  A variant by any name: quantifying annotation discordance across tools and clinical databases , 2016, Genome Medicine.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Ronald I. Greenberg,et al.  Fast and Simple Computation of All Longest Common Subsequences , 2002, ArXiv.

[11]  Raymond Dalgleish,et al.  VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions , 2017, Human mutation.

[12]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[13]  Alexandros Kouris,et al.  VarSome: the human genomic variant search engine , 2018, bioRxiv.

[14]  Kevin S. Hughes,et al.  Ask2Me VarHarmonizer: A Python-Based Tool to Harmonize Variants from Cancer Genetic Testing Reports and Map them to the ClinVar Database , 2019 .

[15]  J. Lember,et al.  Optimal alignments of longest common subsequences and their path properties , 2014, 1407.1233.

[16]  Yifan Peng,et al.  LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC , 2018, Nucleic Acids Res..

[17]  Yun S. Song,et al.  SMaSH: a benchmarking toolkit for human genome variant calling , 2013, Bioinform..

[18]  Dianne Cook,et al.  plyranges: a grammar of genomic data transformation , 2018, Genome Biology.

[19]  Jonathan K. Vis,et al.  Mutalyzer 2: next generation HGVS nomenclature checker , 2020, bioRxiv.

[20]  Amit Kumar Sharma,et al.  The Curation of Genetic Variants: Difficulties and Possible Solutions , 2012, Genom. Proteom. Bioinform..

[21]  Michael Watkins,et al.  Implementing the VMC specification to reduce ambiguity in genomic variant representation , 2020, AMIA.

[22]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[23]  Matthew H. Brush,et al.  The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification , 2021, Cell genomics.

[24]  Joost N. Kok,et al.  An efficient algorithm for the extraction of HGVS variant descriptions from sequences , 2015, Bioinform..

[25]  Lon Phan,et al.  SPDI: Data Model for Variants and Applications at NCBI , 2019, bioRxiv.

[26]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[27]  Raymond Dalgleish,et al.  HGVS Recommendations for the Description of Sequence Variants: 2016 Update , 2016, Human mutation.

[28]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[29]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[30]  George Varghese,et al.  Using Genome Query Language to uncover genetic variation , 2014, Bioinform..

[31]  Ronald I. Greenberg Bounds on the Number of Longest Common Subsequences , 2003, ArXiv.

[32]  C. Bloomfield,et al.  Implementation of standardized variant-calling nomenclature in the age of next-generation sequencing: where do we stand? , 2019, Leukemia.

[33]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.