Machine learning methods with time series dependence

MACHINE LEARNING METHODS WITH TIME SERIES DEPENDENCE Blakeley B. McShane Abraham J. Wyner (Advisor) We introduce the PrAGMaTiSt: Prediction and Analysis for Generalized Markov Time Series of States, a methodology which enhances classification algorithms so that they can accommodate sequential data. The PrAGMaTiSt can model a wide variety of time series structures including arbitrary order Markov chains, generalized and transition dependent generalized Markov chains, and variable length Markov chains. We subject our method as well as competitor methods to a rigorous set of simulations in order to understand its properties. We find, for very low or high levels of noise in Yt|Xt, complexity of Yt|Xt, or complexity of the time series structure, simple methods that either ignore the time series structure or model it as first order Markov can perform as well or better than more complicated models even when the latter are true; however, in moderate settings, the more complicated models tend to dominate. Furthermore, even with little training data, the more complicated models perform about as well as the simple ones when the latter are true. We also apply vii the PrAGMaTiSt to the important problem of sleep scoring of mice based on video data. Our procedure provides more accurate differentiation of the NREM and REM sleep states compared to any previous method in the field. The improvements in REM classification are particularly beneficial, as the dynamics of REM sleep are of special interest to sleep scientists. Furthermore, our procedure provides substantial improvements in capturing the sleep state bout duration distributions relative to other methods.

[1]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[2]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[3]  J. Friedman Stochastic gradient boosting , 2002 .

[4]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[5]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[6]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[9]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[10]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Jin H. Kim,et al.  Nonstationary hidden Markov model , 1995, Signal Process..

[13]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[14]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[15]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[16]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[21]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[22]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[23]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[24]  Seiji Nishino,et al.  Specificity of direct transition from wake to REM sleep in orexin/ataxin-3 transgenic narcoleptic mice , 2009, Experimental Neurology.

[25]  D. L. Donoho,et al.  Ideal spacial adaptation via wavelet shrinkage , 1994 .

[26]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[27]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[28]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[29]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[30]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[31]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[32]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[33]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[34]  Henrik Boström,et al.  Calibrating Random Forests , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[35]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[36]  H. Akaike A new look at the statistical model identification , 1974 .

[37]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[38]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[39]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[40]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[41]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[42]  Dean Phillips Foster,et al.  An Information Theoretic Comparison of Model Selection Criteria , 1997 .

[43]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[44]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[45]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[46]  Thomas G. Dietterich,et al.  Training conditional random fields via gradient tree boosting , 2004, ICML.

[47]  Petar M. Djuric,et al.  An MCMC sampling approach to estimation of nonstationary hidden Markov models , 2002, IEEE Trans. Signal Process..

[48]  Torsten Hothorn,et al.  Model-based boosting in high dimensions , 2006, Bioinform..

[49]  Shane T. Jensen,et al.  Characterization of the bout durations of sleep and wakefulness , 2010, Journal of Neuroscience Methods.

[50]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[51]  Henrik Boström Estimating class probabilities in random forests , 2007, ICMLA 2007.

[52]  Dean P. Foster,et al.  Variable Selection in Data Mining , 2004 .

[53]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[54]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[55]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[56]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[57]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[58]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[59]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[60]  Dimitris N. Metaxas,et al.  Novel method for high-throughput phenotyping of sleep in mice. , 2007, Physiological genomics.

[61]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[62]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[63]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[64]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[65]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[66]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[67]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[68]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[69]  S. V. Vaseghi State duration modelling in hidden Markov models , 1995, Signal Process..

[70]  Padhraic Smyth,et al.  Markov monitoring with unknown states , 1994, IEEE J. Sel. Areas Commun..

[71]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .