Multivariate Analytic Combinatorics for Cost Constrained Channels and Subsequence Enumeration

Analytic combinatorics in several variables is a powerful tool for deriving the asymptotic behavior of combinatorial quantities by analyzing multivariate generating functions. We study information-theoretic questions about sequences in a discrete noiseless channel under cost and forbidden substring constraints. Our main contributions involve the relationship between the graph structure of the channel and the singularities of the bivariate generating function whose coefficients are the number of sequences satisfying the constraints. We combine these new results with methods from multivariate analytic combinatorics to solve questions in many application areas. For example, we determine the optimal coded synthesis rate for DNA data storage when the synthesis supersequence is any periodic string. This follows from a precise characterization of the number of subsequences of an arbitrary periodic strings. Along the way, we provide a new proof of the equivalence of the combinatorial and probabilistic definitions of the costconstrained capacity, and we show that the cost-constrained channel capacity is determined by a cost-dependent singularity, generalizing Shannon’s classical result for unconstrained capacity. Institute for Communications Engineering, Technical University of Munich, Germany, andreas.lenz@mytum.de. AL has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 801434) Department of Combinatorics & Optimization, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada, smelczer@uwaterloo.ca. SM was supported by an NSERC Discovery Grant. Google Research, cyroid@google.com. Work partially completed while at UCSD. Department of Electrical and Computer Engineering, University of California, San Diego, psiegel@ucsd.edu 1 ar X iv :2 11 1. 06 10 5v 2 [ cs .I T ] 1 4 N ov 2 02 1

[1]  Christopher N. Takahashi,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[2]  Yuliy Baryshnikov,et al.  Two-dimensional Quantum Random Walk , 2008, Journal of Statistical Physics.

[3]  Mark C. Wilson,et al.  Analytic Combinatorics in Several Variables , 2013 .

[4]  Günter Rote,et al.  A Dynamic Programming Algorithm for Constructing Optimal Prefix-Free Codes with Unequal Letter Costs , 1998, IEEE Trans. Inf. Theory.

[5]  Masakatu Morii,et al.  An Efficient Universal Coding Algorithm for Noiseless Channel with Symbols of Unequal Cost , 1997 .

[6]  Imre Csiszár,et al.  Simple Proofs of Some Theorems on Noiseless Channels , 1969, Inf. Control..

[7]  Richard M. Karp,et al.  Minimum-redundancy coding for the discrete noiseless channel , 1961, IRE Trans. Inf. Theory.

[8]  Stephen Melczer An Invitation to Analytic Combinatorics: From One to Several Variables , 2021 .

[9]  Brian H. Marcus,et al.  State splitting for variable-length graphs , 1986, IEEE Trans. Inf. Theory.

[10]  Jonathan J. Ashley A linear bound for sliding-block decoder window size , 1988, IEEE Trans. Inf. Theory.

[11]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[12]  Ralph M. Krause,et al.  Channels Which Transmit Letters of Unequal Duration , 1962, Inf. Control..

[13]  G. Church,et al.  Photon-directed multiplexed enzymatic DNA synthesis for molecular digital data storage , 2020, Nature Communications.

[14]  Mireille Régnier,et al.  Tight Bounds on the Number of String Subsequences DANIEL S , 2000 .

[15]  Majid Khabbazian,et al.  On the Number of Subsequences When Deleting Symbols From a String , 2008, IEEE Transactions on Information Theory.

[16]  Philippe Flajolet,et al.  Analytic Combinatorics , 2009 .

[17]  Andreas Lenz,et al.  Coding over Sets for DNA Storage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[18]  Paul H. Siegel,et al.  Complexity and sliding-block decodability , 1996, IEEE Trans. Inf. Theory.

[19]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[20]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[21]  Osamu Uchida Maximum generating rate of the variable-length nonuniform random number , 2001, Proceedings 2001 IEEE Information Theory Workshop (Cat. No.01EX494).

[22]  Hon Wai Leong,et al.  The multiple sequence sets: problem and heuristic algorithms , 2011, J. Comb. Optim..

[23]  Ashish Jagmohan,et al.  Adaptive endurance coding for NAND Flash , 2010, 2010 IEEE Globecom Workshops.

[24]  Schouhamer Immink,et al.  Codes for mass data storage systems , 2004 .

[25]  Naveen Goela,et al.  Terminator-free template-independent enzymatic DNA synthesis for digital information storage , 2019, Nature Communications.

[26]  Jehoshua Bruck,et al.  Coding for Optimized Writing Rate in DNA Storage , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[27]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[28]  Shubham Chandak,et al.  Overcoming High Nanopore Basecaller Error Rates for DNA Storage Via Basecaller-Decoder Integration and Convolutional Codes , 2019, bioRxiv.

[29]  Ben Varn,et al.  Optimal Variable Length Codes (Arbitrary Symbol Cost and Equal Code Word Probability) , 1971, Inf. Control..

[30]  M. Hassner,et al.  Algorithms for sliding block codes - An application of symbolic dynamics to information theory , 1983, IEEE Trans. Inf. Theory.

[31]  P. A. Franaszek,et al.  Sequence-state coding for digital transmission , 1968 .

[32]  Peter A. Franaszek,et al.  On Synchronous Variable Lenght Coding for Discrete Noiseless Channels , 1969, Inf. Control..

[33]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[34]  Rudolf Mathar,et al.  Capacity achieving probabilistic shaping for noisy and noiseless channels , 2012 .

[35]  Steven W. McLaughlin,et al.  Shaping codes constructed from cost-constrained graphs , 1997, IEEE Trans. Inf. Theory.

[36]  Jonathan James Ashley Performance bounds in constrained sequence coding , 1987 .

[37]  Richard S Marcus,et al.  Discrete noiseless coding , 1957 .

[38]  Patrick Schulte,et al.  Probabilistic Shaping and Forward Error Correction for Fiber-Optic Communication Systems , 2019, Journal of Lightwave Technology.

[39]  William E. Ryan,et al.  Coding for Optical Channels , 2010 .

[40]  Cyrus Rashtchian,et al.  Trace Reconstruction Problems in Computational Biology , 2020, ArXiv.

[41]  Kurt Mehlhorn,et al.  An efficient algorithm for constructing nearly optimal prefix codes , 1980, IEEE Trans. Inf. Theory.

[42]  Cyrus Rashtchian,et al.  Batch Optimization for DNA Synthesis , 2020, IEEE Transactions on Information Theory.

[43]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[44]  M. Caruthers,et al.  The Chemical Synthesis of DNA/RNA: Our Gift to Science , 2012, The Journal of Biological Chemistry.

[45]  T. Han,et al.  Source code with cost as a nonuniform random number generator , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[46]  Patrick Schulte,et al.  Bandwidth Efficient and Rate-Matched Low-Density Parity-Check Coded Modulation , 2015, IEEE Transactions on Communications.

[47]  Tom Høholdt,et al.  Maxentropic Markov chains , 1984, IEEE Trans. Inf. Theory.

[48]  Ilan Shomorony,et al.  DNA-Based Storage: Models and Fundamental Limits , 2020, IEEE Transactions on Information Theory.

[49]  Olgica Milenkovic,et al.  DNA punch cards for storing data on native DNA sequences via enzymatic nicking , 2020, Nature Communications.

[50]  Farzad Farnoud,et al.  Error-correcting Codes for Short Tandem Duplication and Substitution Errors , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[51]  Joseph B. Soriaga On the Design of Finite-State Shaping Encoders for Partial-Response Channels , 2006 .

[52]  Andrew B. Kahng,et al.  Border Length Minimization in DNA Array Design , 2002, WABI.

[53]  Kees A. Schouhamer Immink,et al.  EFMplus: the coding format of the multimedia compact disc , 1995 .

[54]  Mark C. Wilson,et al.  Twenty Combinatorial Examples of Asymptotics Derived from Multivariate Generating Functions , 2005, SIAM Rev..

[55]  Arvind M. Patel Zero-modulation encoding in magnetic recording , 1975 .

[56]  L. Ceze,et al.  Molecular digital data storage using DNA , 2019, Nature Reviews Genetics.

[57]  Brian H. Marcus,et al.  Construction of encoders with small decoding look-ahead for input-constrained channels , 1995, IEEE Trans. Inf. Theory.

[58]  Edgar N. Gilbert,et al.  Coding with digits of unequal cost , 1995, IEEE Trans. Inf. Theory.

[59]  Aaron D. Wyner,et al.  Optimum Block Codes for Noiseless Input Restricted Channels , 1964, Inf. Control..

[60]  Leon Anavy,et al.  Data storage in DNA with fewer synthesis cycles using composite DNA letters , 2019, Nature Biotechnology.

[61]  Brian H. Marcus,et al.  Canonical Encoders for Sliding Block Decoders , 1995, SIAM J. Discret. Math..

[62]  Stephen Melczer,et al.  Effective Coefficient Asymptotics of Multivariate Rational Functions via Semi-Numerical Algorithms for Polynomial Systems , 2019, J. Symb. Comput..

[63]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[64]  Cecilio Pimentel,et al.  Capacity of General Discrete Noiseless Channels , 2008, ArXiv.

[65]  Paul H. Siegel,et al.  Rate-Constrained Shaping Codes for Structured Sources , 2020, IEEE Transactions on Information Theory.

[66]  David L. Neuhoff,et al.  Coding for channels with cost constraints , 1996, IEEE Trans. Inf. Theory.

[67]  Lalit R. Bahl,et al.  Block Codes for a Class of Constrained Noiseless Channels , 1970, Inf. Control..

[68]  P. Franaszek Sequence-state methods for run-length-limited coding , 1970 .

[69]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[70]  Andreas Lenz,et al.  Coding for Efficient DNA Synthesis , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[71]  Jonathan J. Ashley A linear bound for sliding-block decoder window size, II , 1996, IEEE Trans. Inf. Theory.

[72]  Rudolf Mathar,et al.  Matching Dyadic Distributions to Channels , 2010, 2011 Data Compression Conference.

[73]  Hon Wai Leong,et al.  The distribution and deposition algorithm for multiple oligo nucleotide arrays. , 2006, Genome informatics. International Conference on Genome Informatics.

[74]  Shaohua Yang,et al.  On modulation coding for channels with cost constraints , 2014, 2014 IEEE International Symposium on Information Theory.

[75]  R.M. Roth,et al.  On the decoding delay of encoders for input-constrained channels , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.