Summarizing a set of time series by averaging: From Steiner sequence to compact multiple alignment

Summarizing a set of sequences is an old topic that has been revived in the last decade, due to the increasing availability of sequential datasets. The definition of a consensus object is on the center of data analysis issues, since it crystallizes the underlying organization of the data. Dynamic Time Warping (DTW) is currently the most relevant similarity measure between sequences for a large panel of applications, since it makes it possible to capture temporal distortions. In this context, averaging a set of sequences is not a trivial task, since the average sequence has to be consistent with this similarity measure. The Steiner theory and several works in computational biology have pointed out the connection between multiple alignments and average sequences. Taking inspiration from these works, we introduce the notion of compact multiple alignment, which allows us to link these theories to the problem of summarizing under time warping. Having defined the link between the multiple alignment and the average sequence, the second part of this article focuses on the scan of the space of compact multiple alignments in order to provide an average sequence of a set of sequences. We propose to use a genetic algorithm based on a specific representation of the genotype inspired by genes. This representation of the genotype makes it possible to consistently paint the fitness landscape. Experiments carried out on standard datasets show that the proposed approach outperforms existing methods.

[1]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[2]  Pierre Gançarski,et al.  A global averaging method for dynamic time warping, with applications to clustering , 2011, Pattern Recognit..

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[5]  Jimin Pei,et al.  PCMA: fast and accurate multiple sequence alignment based on profile consistency , 2003, Bioinform..

[6]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[7]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[8]  John J. Grefenstette,et al.  Lamarckian Learning in Multi-Agent Environments , 1991, ICGA.

[9]  Larry S. Davis,et al.  Towards 3-D model-based tracking and recognition of human movement: a multi-view approach , 1995 .

[10]  C. Ratanamahatana,et al.  Shape averaging under Time Warping , 2009, 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[11]  J. Baldwin A New Factor in Evolution , 1896, The American Naturalist.

[12]  G. D. Smith,et al.  Solving the Graphical Steiner Tree Problem Using Genetic Algorithms , 1993 .

[13]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[14]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[15]  Vit Niennattrakul,et al.  Inaccuracies of Shape Averaging Method Using Dynamic Time Warping for Time Series Data , 2007, International Conference on Computational Science.

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[18]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[19]  T. Liao,et al.  An adaptive genetic clustering method for exploratory mining of feature vector and time series data , 2006 .

[20]  A. N. Rajagopalan,et al.  Off-line signature verification using DTW , 2007, Pattern Recognit. Lett..

[21]  L. Gupta,et al.  Nonlinear alignment and averaging for estimating the evoked potential , 1996, IEEE Transactions on Biomedical Engineering.

[22]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[23]  David E. Goldberg,et al.  Optimizing Global-Local Search Hybrids , 1999, GECCO.

[24]  Peter D. Turney Myths and Legends of the Baldwin Effect , 2002, ICML 2002.

[25]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[26]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[27]  Kurt Hornik,et al.  A Combination Scheme for Fuzzy Clustering , 2002, AFSS.

[28]  Joseph B. Kruskall,et al.  The Symmetric Time-Warping Problem : From Continuous to Discrete , 1983 .

[29]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[30]  Eamonn J. Keogh,et al.  Scaling and time warping in time series querying , 2005, The VLDB Journal.

[31]  George M. Church,et al.  Aligning gene expression time series with time warping algorithms , 2001, Bioinform..

[32]  S. Gubser Time warps , 2008, 0812.5107.

[33]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[34]  Jan Paredis,et al.  Coevolutionary Life-Time Learning , 1996, PPSN.

[35]  Lance D. Chambers,et al.  Practical Handbook of Genetic Algorithms , 1995 .

[36]  Brian J. Ross,et al.  A Lamarckian Evolution Strategy for Genetic Algorithms , 1998, Practical Handbook of Genetic Algorithms.

[37]  John J. Grefenstette,et al.  Proceedings of the 1st International Conference on Genetic Algorithms , 1985 .

[38]  A. Gray,et al.  I. THE ORIGIN OF SPECIES BY MEANS OF NATURAL SELECTION , 1963 .

[39]  H. Pollak,et al.  Steiner Minimal Trees , 1968 .

[40]  L. Darrell Whitley,et al.  Lamarckian Evolution, The Baldwin Effect and Function Optimization , 1994, PPSN.

[41]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[42]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[43]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[44]  Dararat Srisai,et al.  Contrast enhanced dynamic time warping distance for time series shape averaging classification , 2009, ICIS '09.

[45]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[46]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[47]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[48]  Man-Wai Mak,et al.  Exploring the effects of Lamarckian and Baldwinian learning in evolving recurrent neural networks , 1997, Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC '97).