Transducing Markov sequences

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any subexponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

[1]  K. G. Murty An Algorithm for Ranking All the Assignment in Order of Increasing Cost , 1968 .

[2]  Frank Neven,et al.  Typechecking Top-Down Uniform Unranked Tree Transducers , 2003, ICDT.

[3]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[4]  Christopher Ré,et al.  Lahar Demonstration: Warehousing Markovian Streams , 2009, Proc. VLDB Endow..

[5]  Stathis Zachos,et al.  Probabilistic Quantifiers and Games , 1988, J. Comput. Syst. Sci..

[6]  Galina Jirásková,et al.  State complexity of some operations on binary regular languages , 2005, Theor. Comput. Sci..

[7]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[8]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[9]  Vangelis Th. Paschos,et al.  Differential approximation of min sat , 2005, Eur. J. Oper. Res..

[10]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[11]  Christopher Ré,et al.  Access Methods for Markovian Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Christoph Koch,et al.  A compositional query algebra for second-order logic and uncertain databases , 2008, ICDT '09.

[13]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[14]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[15]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[16]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[17]  J. Y. Yen,et al.  Finding the K Shortest Loopless Paths in a Network , 2007 .

[18]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Anthony J. Bonner,et al.  Sequences, Datalog, and Transducers , 1998, J. Comput. Syst. Sci..

[21]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[22]  Bertram Ludäscher,et al.  A Transducer-Based XML Query Processor , 2002, VLDB.

[23]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[24]  Michael R. Fellows,et al.  FIXED-PARAMETER TRACTABILITY AND COMPLETENESS , 2022 .

[25]  Yehoshua Sagiv,et al.  Maximally joining probabilistic data , 2007, PODS.

[26]  André Kempe Finite state transducers approximating Hidden Markov Models , 1997 .

[27]  Michael Zink,et al.  Capturing Data Uncertainty in High-Volume Stream Processing , 2009, CIDR.

[28]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Fatos T. Yarman-Vural,et al.  Optical Character Recognition for Cursive Handwriting , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Frederick Reiss,et al.  Spanners: a formal framework for information extraction , 2013, PODS '13.

[33]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[34]  Prashant J. Shenoy,et al.  Probabilistic Inference over RFID Streams in Mobile Environments , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[35]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[36]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[37]  C. Ré,et al.  Transducing Markov Sequences Extended , 2010 .

[38]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[39]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[40]  Jian Zhou,et al.  Off-Line Handwritten Word Recognition Using a Hidden Markov Model Type Stochastic Network , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[42]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[43]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[44]  Tyler Baldwin,et al.  Adaptive Parser-Centric Text Normalization , 2013, ACL.

[45]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[46]  Anthony J. Bonner,et al.  Querying sequence databases with transducers , 2000, Acta Informatica.

[47]  Lars Engebretsen,et al.  Clique Is Hard To Approximate Within , 2000 .

[48]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[49]  Katta G. Murty,et al.  Letter to the Editor - An Algorithm for Ranking all the Assignments in Order of Increasing Cost , 1968, Oper. Res..

[50]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[51]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[52]  Mitsunori Ogihara,et al.  Counting Classes are at Least as Hard as the Polynomial-Time Hierarchy , 1992, SIAM J. Comput..

[53]  Mihalis Yannakakis,et al.  On the Complexity of Database Queries , 1999, J. Comput. Syst. Sci..

[54]  W. Clem Karl,et al.  Multiscale segmentation and anomaly enhancement of SAR imagery , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[55]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[56]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[57]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[58]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[59]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[60]  Miron Livny,et al.  SEQ: A model for sequence databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[61]  Yehoshua Sagiv,et al.  Efficiently enumerating results of keyword search over data graphs , 2008, Inf. Syst..

[62]  Yehoshua Sagiv,et al.  Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties , 2008, J. Comput. Syst. Sci..

[63]  Yehoshua Sagiv,et al.  Running tree automata on probabilistic XML , 2009, PODS.

[64]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[65]  Takashi Saito,et al.  Semantics analysis through elementary meanings: theoretical foundation for generalized thesaurus construction , 2000 .

[66]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[67]  Sampath Kannan,et al.  Counting and random generation of strings in regular languages , 1995, SODA '95.

[68]  Christoph Koch,et al.  Approximating predicates and expressive queries on probabilistic databases , 2008, PODS.

[69]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[70]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.