An Efficient Search Algorithm for Finding Genomic-Range Overlaps Based on the Maximum Range Length

Efficient search algorithms for finding genomic-range overlaps are essential for various bioinformatics applications. A majority of fast algorithms for searching the overlaps between a query range (e.g., a genomic variant) and a set of N reference ranges (e.g., exons) has time complexity of O(k + logN), where kdenotes a term related to the length and location of the reference ranges. Here, we present a simple but efficient algorithm that reduces k, based on the maximum reference range length. Specifically, for a given query range and the maximum reference range length, the proposed method divides the reference range set into three subsets: always, potentially, and never overlapping. Therefore, search effort can be reduced by excluding never overlapping subset. We demonstrate that the running time of the proposed algorithm is proportional to potentially overlapping subset size, that is proportional to the maximum reference range length if all the other conditions are the same. Moreover, an implementation of our algorithm was 13.8 to 30.0 percent faster than one of the fastest range search methods available when tested on various genomic-range data sets. The proposed algorithm has been incorporated into a disease-linked variant prioritization pipeline for WGS (http://gnome.tchlab.org) and its implementation is available at http://ml.ssu.ac.kr/gSearch.

[1]  In-Hee Lee,et al.  Prioritizing Disease‐Linked Variants, Genes, and Pathways with an Interactive Whole‐Genome Analysis Pipeline , 2014, Human mutation.

[2]  Kevin Skadron,et al.  Binary Interval Search: a scalable algorithm for counting interval intersections , 2013, Bioinform..

[3]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[4]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[5]  Joel E. Richardson,et al.  fjoin: Simple and Efficient Computation of Feature Overlaps , 2006, J. Comput. Biol..

[6]  Sek Won Kong,et al.  gSearch: a fast and flexible general search tool for whole-genome sequencing , 2012, Bioinform..

[7]  Matthias Zytnicki,et al.  Efficient comparison of sets of intervals with NC-lists , 2013, Bioinform..

[8]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[9]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[10]  Alexander V. Alekseyenko,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl647 Data and text mining Nested Containment List (NCList): a new algorithm , 2022 .

[11]  Michael T. McManus,et al.  Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs , 2013, PLoS genetics.

[12]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[13]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[14]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[15]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[16]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[17]  Jared C. Roach,et al.  Kaviar: an accessible system for testing SNV novelty , 2011, Bioinform..

[18]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[19]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[20]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[21]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[22]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[23]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[24]  G. Cutting,et al.  Annotating DNA variants is the next major goal for human genetics. , 2014, American journal of human genetics.

[25]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[26]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[27]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[28]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.