Indexing a Dictionary for Subset Matching Queries

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm + |Σ|kn lg(min{n, m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|kn + |Σ|k/2n lg(min{n, m})) preprocessing time and O(|p| lg lg |&Sigma| + min{|p|, lg(|Σ|kn)} lg lg(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [14,17]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the "pure parsimony criteria" [13,16], greedy heuristics such as "Clarks rule" [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].

[1]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[2]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[3]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[4]  Peter van Emde Boas,et al.  Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space , 1977, Inf. Process. Lett..

[5]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[6]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[7]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[8]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[9]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[10]  Gad M. Landau,et al.  Parallel Suffix-Prefix-Matching Algorithm and Applications , 1996, SIAM J. Comput..

[11]  Piotr Indyk,et al.  Faster algorithms for string matching problems: matching the convolution bound , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[12]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[13]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[14]  L. Helmuth Genome research: map of the human genome 3.0. , 2001, Science.

[15]  L. Helmuth Map of the Human Genome 3.0 , 2001, Science.

[16]  Torben Hagerup Simpler and Faster Dictionaries on the AC0 RAM , 1998, ICALP.

[17]  Andrew G. Clark,et al.  Computational Methods for SNPs and Haplotype Inference , 2002, Lecture Notes in Computer Science.

[18]  Shibu Yooseph,et al.  A Survey of Computational Methods for Determining Haplotypes , 2002, Computational Methods for SNPs and Haplotype Inference.

[19]  Peisen Zhang,et al.  Optimal Step Length EM Algorithm (OSLEM) for the estimation of haplotype frequency and its application in lipoprotein lipase genotyping , 2002, BMC Bioinformatics.

[20]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[21]  Richard Cole,et al.  Verifying candidate matches in sparse and wildcard matching , 2002, STOC '02.

[22]  Giorgio Satta,et al.  Efficient text fingerprinting via Parikh mapping , 2003, J. Discrete Algorithms.

[23]  Dan Gusfield,et al.  Haplotype Inference by Pure Parsimony , 2003, CPM.

[24]  Richard M. Karp,et al.  The minimum-entropy set cover problem , 2005, Theor. Comput. Sci..

[25]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[26]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[27]  Joseph JáJá,et al.  Novel Transformation Techniques Using Q-Heaps with Applications to Computational Geometry , 2005, SIAM J. Comput..

[28]  Heikki Mannila,et al.  A Hidden Markov Technique for Haplotype Reconstruction , 2005, WABI.

[29]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[30]  Alex Zelikovsky,et al.  Phasing and Missing Data Recovery in Family Trios , 2005, International Conference on Computational Science.

[31]  Alexander Russell,et al.  Minimum Multicolored Subgraph Problem in Multiplex PCR Primer Set Selection and Population Haplotyping , 2006, International Conference on Computational Science.

[32]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[33]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[34]  Esko Ukkonen,et al.  Haplotype Inference Via Hierarchical Genotype Parsing , 2007, WABI.

[35]  Jens Stoye,et al.  Character sets of strings , 2007, J. Discrete Algorithms.

[36]  Mathieu Raffinot,et al.  New algorithms for text fingerprinting , 2008, J. Discrete Algorithms.