Repeated Measures Analysis with Discrete Data Using the SAS®System

The analysis 01 correlated dala arising from repeated measurements when the measurements are assumed to be multivariate normal has been studied exten· sively. In many practical problems, however, the normality assumption is not reasonable. When the responses are discrete and correlated, for example, diHerent methodology must be used in the analysis of the data. Generalized Estimating Equations (GEEs) provide a practical method with reasonable statistical efficiency to analyze such data. This paper provides an overview of the use of GEEs in the analysis of correlated data using the SAS System. Emphasis is placed on discrete correlated dala, since this is an area of great practical interest. Introduction GEEs were introduced by Liang and Zeger (1986) as a method of dealing with correlated data when, except for the correlation among responses, the data can be modeled as a generalized linear model. For example, correlated binary and count data in many cases can be modeled in this way. A SAS macro, wrHtenby M .R. Karim at Johns Hopkins University is available to fit such models by solving GEEs. Work is in progress to add the capability to solve GEEs to the GENMOO procedure in SASISTAT software. This paper provides an overview of the GEE methodology that will be implemented in the GENMOO procedure. ReIer to Oiggle, Liang, and Zeger (1994) and the other references at the end of this paper for details on this method. Correlated data can arise from Situations such as • longitudinal studies, in which multiple measurements are taken on the same subject at different points in time • clustering, where measurements are taken on subjects that share a common category or characteristic that leads to correlation. For example, incidence of pulmonary disease among family 1250 members may be correlated because of hereditary factors. The correlation must be accounted for by analysis methods appropriate to the data. Possible consequences of analyzing correlated data as if it were independent are • incorrect inferences concerning regression pa· rameters due to underestimated standard errors • inefficient estimators, that is, more mean square error in regression parameter estimators than necessary Example of Longitudinal Data These data, from Thall and Vail (1990), are concerned with the treatment of epileptic seizure episodes. These data were also analyzed in Oiggle, Liang, and Zeger (1994). The data consists of the number of epileptic seizures in an eight-week baseline period, before any treatment, and in each of four two-week treatment periods, in which patients received either a placebo or the drug Progabide as an adjunct to other chemotherapy. A portion of the data is shown in Table 1. Table 1. Epileptic Seizure Data Pati&nt1D T.rea1meont BaMli~ Visit1 Visit2 VisirS .it4 ~; ;=~ ~~ ; 5 ; ; 107 p• 2 4 0 5 101 Progabide 7. 11 14 9 8 102 Prosabide " 8 7 9 4 H>' Pf0S3bid$ " 0 4 , 0 Within-subject measurements are likely to be correlated, whereas between·subject measurements are likely to be independent. The raw correlations among the counts between visits are shown in Figure 1. They indicate strong correlation in the number of seizures between the visits. The seizures data will be ana· Iyzed in later sections as count data with a specified correlation structure. Figure 1. Raw Correlations Visit 1 Visit 2 Visit 3 Visit 4 Visit 1 1.00 Visit 2 .69 1.00 VisitS .54 .67 1.00 Visit 4 .72 .76 .71 1.00 Generalized Linear Models for Independent Data let Y;, i = 1, ... , n be independent measurements. Generalized linear models lor independent data are characterized by • a systematic component g( E(Yi )) = g(p.;) = x,' f3 where Pi = E(y;), 9 is a link. function that relates the means 01 the responses to. the linear predictor Xi' f3. Xi is a vector of independent variables lor the ~h observation. and f3 is a vector of regression parameters to be estimated. • a random component: Y;, i = 1, ... , n are independent and have a probability distribution from an exponential family: Y; ~ exponential family: binomial, Poisson. normal, gamma, inverse gaussian The exponential family assumption implies that the variance of Y; is given by v.: = q,v(p,). where v is a variance function that is determined by the specific probability distribution and tP is a dispersion parameter that may be known or may be estimated from the data, depending on the specific model. The variance function for the binomial and Poisson distributions are given by • binomial: vip) = p(1 p) • Poisson: v(p) = I' The maximum likelihood estimator of the p >< 1 parametervector f3 is obtained by solving the estimating equations m [! , E :r; v,' (y, -1',«(3)) = 0 i:::.1 lor /3. This is a nonlinear system of equations for f3 and it can be solved iteratively by the Fisher scoring or NeWlon-Raphson algorithm. 1251 Statistics, Data Analysis, and Modeling Modeling Correlation Generalized Estimating Equations Let Ytj, j = 1, ... ,n"i = 1, ... ,K representthe}lh measurement on the ~h subject. There are n, measurmen1s on subject i and I.:~, n, total measurements. Correlated data are modeled using the same linkfunction and linear predictor setup (systematic component) as the independence case. The random component is described by the same variance functiOns as in the independence case. but the covariance structure of the correlated measurements must also be modeled. Let the vector of measurements on the fth subject be Y, = [y", ... , Ytn.l' with corresponding vector of means lJi = lI'il,' .. , P'n.]' and let Vi be an estimate of the covariance matrix of Y,. The Generalized Estimating Equation for estimating f3 is an extension of the independence estimating equation to correlated data and is given by Working Correlations Let R;(a) be an n, x n, "working" correlation matrix that is fully speo'fied by the vector of parameters a. The covariance matrix of Y i is modeled as , , Vi = q,A'R;(a)A' where A is an n, x n, diagonal matrix with v(p';) as the fth diagonal element. If R;( a) is the true correlation matrix of Y,. then V, will be the true covariance matrix of Y,. The working correlation matrix is not usually known and must be estimated. It is estimated in the iterative fitting process using the current value of the parameter vector f3 to compute appropriate functions of the Pearson residual r' . _ Yij JJij 'J Jv(I";) There are several specific choices of the form of working correlation matrix R;( a) commonly used to model the correlation matrix of Y i . A few of the choices are shown here. Refer to liang and Zeger (1986) for additional choices. The dimension of the vector "'. which is treated as a nuisance parameter. and the form of the estimator of a: are different for each choice. Statistics, Data Analysis, and Modeling • R.;(a) = Ro, a fixed correlation matrix. For Ro = I, the identity matrix, the GEE reduces to the independence estimating equation. • m-dependent: G ( v ) {at t = 1 ) 2; , .. j m orr"j,'i,!+' = 0 t>m • Exchangeable: Corr(Yij, Yi.) = (t, j '" k • Unstructured: Corr(Yij, Yi.) = "'i' Fitting Algorithm The following Is an algorithm for filting the specified model usirlg GEEs. • Compute an initial estimate of {3, for example with an ordinary generalized linear model assuming independence. • Compute the working correlations R.;(a). • Compute an estimate of the covariance: I ' I V, = ¢A'R.;(a)A> • Update {3: {3r+l = {3rK~I ~ K~I ['" ~ V-l~l-l['" ~ V-'(y. .)] L..J 8{3 i a/3 L..J a{3 I • p. 1=1 1=1 • Iterate until convergence.