Logistic Regression Models for the Analysis of Correlated Data

Up to this point in the text we have considered the use of the logistic regression model in settings where we observe a single dichotomous response for a sample of statistically independent subjects. However, there are settings where the assumption of independence of responses may not hold for a variety of reasons. For example, consider a study of asthma in children in which subjects are interviewed bi-monthly for 1 year. At each interview the date is recorded and the mother is asked whether, during the previous 2 months, her child had an asthma attack severe enough to require medical attention, whether the child had a chest cold, and how many smokers lived in the household. The child’s age and race are recorded at the first interview. The primary outcome is the occurrence of an asthma attack. What differs here is the lack of independence in the observations due to the fact that we have six measurements on each child. In this example, each child represents a cluster of correlated observations of the outcome. The measurements of the presence or absence of a chest cold and the number of smokers residing in the household can change from observation to observation and thus are called clusterspecific or time-varying covariates. The date changes in a systematic way and is recorded to model possible seasonal effects. The child’s age and race are constant for the duration of the study and are referred to as cluster-level or time-invariant covariates. The terms clusters, subjects, cluster-specific and cluster-level covariates are general enough to describe multiple measurements on a single subject or single measurements on different but related subjects. An example of the latter setting would be a study of all children in a household. Repeated measurements on the same subject or a subject clustered in some sort of unit (household, hospital, or physician) are the two most likely scenarios leading to correlated data.