A brief survey on sequence classification

Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

[1]  Lee Aaron Newberg Memory-efficient dynamic programming backtrace and pairwise local sequence alignment , 2008, Bioinform..

[2]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[3]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[4]  Andrew M. Lynn,et al.  HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences , 2007, BMC Bioinformatics.

[5]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[6]  Eamonn J. Keogh,et al.  Time series shapelets: a new primitive for data mining , 2009, KDD.

[7]  Dror G. Feitelson,et al.  Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals , 2009, WSCD '09.

[8]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Charu C. Aggarwal,et al.  On effective classification of strings with wavelets , 2002, KDD.

[12]  Vasant Honavar,et al.  Discriminatively trained Markov model for sequence classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Li Wei,et al.  Semi-supervised time series classification , 2006, KDD '06.

[14]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Juan José Rodríguez Diez,et al.  Early Fault Classification in Dynamic Systems Using Case-Based Reasoning , 2005, CAEPIA.

[16]  Philip S. Yu,et al.  Mining Sequence Classifiers for Early Prediction , 2008, SDM.

[17]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[18]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[19]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[20]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[21]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2005, BMC Bioinformatics.

[22]  Dimitrios I. Fotiadis,et al.  Motif-Based Protein Sequence Classification Using Neural Networks , 2005, J. Comput. Biol..

[23]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[24]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[25]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[26]  Daniel Kudenko,et al.  Feature Generation for Sequence Categorization , 1998, AAAI/IAAI.

[27]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[28]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Henrik Boström,et al.  Boosting interval based literals , 2001, Intell. Data Anal..

[32]  M. Tonelli,et al.  CHAPTER 3 , 2006, Journal of the American Society of Nephrology.

[33]  Claude Sammut,et al.  Classification of Multivariate Time Series and Structured Data Using Constructive Induction , 2005, Machine Learning.

[34]  András Kocsor,et al.  Application of a simple likelihood ratio approximant to protein sequence classification , 2006, Bioinform..

[35]  Latifur Khan,et al.  Real-time classification of variable length multi-attribute motions , 2006, Knowledge and Information Systems.

[36]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[37]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[38]  Mohammed Waleed Kadous,et al.  Temporal classification: extending the classification paradigm to multivariate time series , 2002 .

[39]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[40]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[41]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[42]  Ming Li,et al.  A robust approach to sequence classification , 2005, 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05).

[43]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[44]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[45]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[46]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[47]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[48]  M. P. Griffin,et al.  Toward the early diagnosis of neonatal sepsis and sepsis-like illness using novel heart rate analysis. , 2001, Pediatrics.

[49]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[50]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[51]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[52]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[53]  Philip S. Yu,et al.  Early prediction on time series: a nearest neighbor approach , 2009, IJCAI 2009.

[54]  Shi Zhong,et al.  Semi-Supervised Sequence Classification With Hmms , 2005, Int. J. Pattern Recognit. Artif. Intell..

[55]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[56]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[57]  Carla E. Brodley,et al.  Temporal sequence learning and data reduction for anomaly detection , 1998, CCS '98.

[58]  Pengzhu Zhang,et al.  Sequence Matching for Suspicious Activity Detection in Anti-Money Laundering , 2008, ISI Workshops.

[59]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[60]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[61]  Sunita Sarawagi,et al.  Sequence Data Mining , 2005 .

[62]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.