论文信息 - Constructing Antidictionaries in Output-Sensitive Space

Constructing Antidictionaries in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1, y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $||M^ℓ_y_1#...#y_N || =o(n), for all N ∊[1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y_1,...,y_k and MaxOut=max{||M^ℓ_y_1#...#y_N||:N ∊[1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Solon P. Pissis | Gabriele Fici | Golnaz Badkobeh | Lorraine A. K. Ayad | Alice Héliou

[1] Antonio Restivo,et al. Automata and Forbidden Words , 1998, Inf. Process. Lett..

[2] Armando J. Pinho,et al. Three minimal sequences found in Ebola virus genomes and absent from human DNA , 2015, Bioinform..

[3] Maxime Crochemore,et al. Alignment-free sequence comparison using absent words , 2018, Inf. Comput..

[4] Maxime Crochemore,et al. Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching , 2017, FCT.

[5] Juha Kärkkäinen,et al. Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform , 2013, ESA.

[6] Solon P. Pissis,et al. Indexing Weighted Sequences: Neat and Efficient , 2020, Inf. Comput..

[7] Solon P. Pissis,et al. emMAW: computing minimal absent words in external memory , 2017, Bioinform..

[8] Hideo Bannai,et al. Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets , 2016, MFCS.

[9] Solon P. Pissis,et al. Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[10] Takuya Takagi,et al. Truncated DAWGs and Their Application to Minimal Absent Word Problem , 2018, SPIRE.

[11] A. Restivo,et al. Data compression using antidictionaries , 2000, Proceedings of the IEEE.

[12] Juha Kärkkäinen,et al. Parallel External Memory Suffix Sorting , 2015, CPM.

[13] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14] Raphael Clifford,et al. String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008 , 2008 .

[15] Costas S. Iliopoulos,et al. On avoided words, absent words, and their application to biological sequence analysis , 2017, Algorithms for Molecular Biology.

[16] S. Muthukrishnan,et al. Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[17] Maxime Crochemore,et al. Algorithms on strings , 2007 .

[18] Jan Holub,et al. DCA Using Suffix Arrays , 2008, Data Compression Conference (dcc 2008).

[19] Hiroyoshi Morita,et al. On the adaptive antidictionary code using minimal forbidden words with constant lengths , 2010, 2010 International Symposium On Information Theory & Its Applications.

[20] Tom H. Pringle,et al. The human genome browser at UCSC. , 2002, Genome research.

[21] Fabio Cunial,et al. A Framework for Space-Efficient String Kernels , 2015, Algorithmica.

[22] Maxime Crochemore,et al. On Extended Special Factors of a Word , 2018, SPIRE.

[23] Gonzalo Navarro,et al. Improved antidictionary based compression , 2002, 12th International Conference of the Chilean Computer Science Society, 2002. Proceedings..