Change-point detection in multinomial data with a large number of categories

We consider a sequence of multinomial data for which the probabilities associated with the categories are subject to abrupt changes of unknown magnitudes at unknown locations. When the number of categories is comparable to or even larger than the number of subjects allocated to these categories, conventional methods such as the classical Pearson’s chi-squared test and the deviance test may not work well. Motivated by high-dimensional homogeneity tests, we propose a novel change-point detection procedure that allows the number of categories to tend to infinity. The null distribution of our test statistic is asymptotically normal and the test performs well with finite samples. The number of change-points is determined by minimizing a penalized objective function based on segmentation, and the locations of the change-points are estimated by minimizing the objective function with the dynamic programming algorithm. Under some mild conditions, the consistency of the estimators of multiple change-points is established. Simulation studies show that the proposed method performs satisfactorily for identifying change-points in terms of power and estimation accuracy, and it is illustrated with an analysis of a real data set.

[1]  Arjun K. Gupta,et al.  Parametric Statistical Change Point Analysis , 2000 .

[2]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[3]  R. Nickl,et al.  Mathematical Foundations of Infinite-Dimensional Statistical Models , 2015 .

[4]  M. Srivastava,et al.  Comparison of EWMA, CUSUM and Shiryayev-Roberts Procedures for Detecting a Shift in the Mean , 1993 .

[5]  L. Horváth,et al.  Limit Theorems in Change-Point Analysis , 1997 .

[6]  V. Ivanov,et al.  On the conditions of asymptotic normality of multidimensional randomized decomposable statistics , 1991 .

[7]  Changliang Zou,et al.  Nonparametric maximum likelihood approach to multiple change-point problems , 2014, 1405.7173.

[8]  Carl N. Morris,et al.  CENTRAL LIMIT THEOREMS FOR MULTINOMIAL SUMS , 1975 .

[9]  P. Hall,et al.  Martingale Limit Theory and its Application. , 1984 .

[10]  M. Srivastava,et al.  Likelihood Ratio Tests for a Change in the Multivariate Normal Mean , 1986 .

[11]  Wilbert C.M. Kallenberg,et al.  On Moderate and Large Deviations in Multinomial Distributions , 1985 .

[12]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[13]  A. Aue,et al.  Break detection in the covariance structure of multivariate time series models , 2009, 0911.3796.

[14]  Jianqing Fan,et al.  Power Enhancement in High Dimensional Cross-Sectional Tests , 2013, Econometrica : journal of the Econometric Society.

[15]  H. Müller,et al.  Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation , 2000 .

[16]  Yi-Ching Yao Estimating the number of change-points via Schwarz' criterion , 1988 .

[17]  D. Hawkins Fitting multiple change-point models to data , 2001 .

[18]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[19]  Lajos Horváth,et al.  Testing for Changes in Multinomial Observations: the Lindisfarne Scribes Problem , 1995 .

[20]  Lars Holst,et al.  Asymptotic normality and efficiency for certain goodness-of-fit tests , 1972 .

[21]  P. Perron,et al.  Estimating and testing linear models with multiple structural changes , 1995 .

[22]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[23]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[24]  Piotr Fryzlewicz,et al.  Wild binary segmentation for multiple change-point detection , 2014, 1411.0858.

[25]  Nancy R. Zhang,et al.  Graph-Based Tests for Two-Sample Comparisons of Categorical Data , 2012, 1208.5755.

[26]  Marc Lavielle,et al.  Using penalized contrasts for the change-point problem , 2005, Signal Process..

[27]  P. Perron,et al.  Testing for a Unit Root in a Time Series with a Changing Mean: Corrections and Extensions , 1992 .

[28]  A power divergence test in the problem of sample homogeneity for large numbers of outcomes and trials , 2005 .

[29]  Alan Agresti,et al.  Categorical Data Analysis, 3rd Edition Extra Exercises , 2012 .