Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms

We investigate combinatorial enumeration problems related to subsequences of strings; in contrast to substrings, subsequences need not be contiguous. For a finite alphabet Σ, the following three problems are solved. (1) Number of distinct subsequences: Given a sequence s ∈Σn and a nonnegative integer k ≤n, how many distinct subsequences of length k does s contain? A previous result by Chase states that this number is maximized by choosing s as a repeated permutation of the alphabet. This has applications in DNA microarray production. (2) Number of ρ-restricted ρ-generated sequences: Given s ∈Σn and integers k ≥1 and ρ≥1, how many distinct sequences in Σk contain no single nucleotide repeat longer than ρ and can be written as $s_1^{r_1}\dots s_n^{r_n}$ with 0≤ri ≤ρ for all i? For ρ= ∞, the question becomes how many length-k sequences match the regular expression s1*s2* ...sn*. These considerations allow a detailed analysis of a new DNA sequencing technology (“454 sequencing”). (3) Exact length distribution of the longest increasing subsequence: Given Σ= {1,...,K} and an integer n ≥1, determine the number of sequences in Σn whose longest strictly increasing subsequence has length k, where 0 ≤k ≤K. This has applications to significance computations for chaining algorithms.