Design of cross‐sectional surveys using cluster sampling: an overview with Australian case studies

ross-sectional surveys are a mainstay ofresearch in epidemiology and C public health, despite the fact that their relatively simple design allows little scope to investigate causal relationships. Their appeal lies in their ability to provide descriptive information about the health of populations (prevalence estimates) and to estimate associations between health states and demographic and other background factors. Although such cross-sectional associations need to be interpreted cautiously, they often provide useful insights into the aetiology of diseases and health-risky behaviours. Examples of cross-sectional sample surveys published recently in Australia include the large-scale National Health Survey1 conducted by the Australian Bureau of Statistics and smaller-scale studies on vaccination coverage in an Australian city,2 transport patterns among young school children? and health and health-related behaviour of adolescents.4vs Given that surveys aim to provide valid estimates of population parameters, random selection of participants is a key principle (unlike, for instance, in randomised controlled trials, where it is randomised allocation between groups that is the key to valid comparisons). Surveys based on simple random samples, where participants are randomly selected from a complete list of all eligible subjects, are easy to conceptualise and analyse, but often very difficult to carry out. Instead, many survey samples are selected using cluster sampling methods, where members of the population are grouped in clusters and sampling takes place in two (or more) stages: first a sample of clusters is selected and then a sample of individuals within clusters. A classic example of cluster sampling is provided by the WHO Expanded Programme on Immunisation (EPI), for which a modified cluster sampling methodology6-* was developed to estimate immunisation coverage in developing countries. This survey entails the random selection of 30'clusters (usually geographical areas) followed by quota (nonrandom) sampling of seven individuals from each cluster. When data for creating sampling frames (lists of eligible subjects) are absent or of dubious quality, as is often the case in developing countries, the EPI technique is a feasible and cost-effective means of collecting information. However, since it is problematic to apply the methods of statistical inference used for surveys to non-random samples, the EPI method precludes the calculation of standard errors and confidence intervals for survey e~t imates .~ More rigorous cluster-based probability sampling methods are, however, used in many cross-sectional health surveys where, for feasibility reasons, it is necessary to base sampling on 'natural' clusters, such as schools, day-care centres, general practices or community health ~entres .*J ,~ The related methodology of cluster-based or group randomised trials has received considerable attention,"'-14 including a recent textbook.I5 Cluster randomisation has been widely employed in the evaluation of community health interventions, such as the Minnesota Heart Health Program,I6 and many trials take advantage of other clusters such as school classes, general practices, work places and households. " I 9 In contrast with cluster-based trials, there has been less written in the public health literature on the use of cluster sampling in crosssectional surveys. This paper reviews issues of sample design, statistical analysis and sample size requirements, with the aim of providing a simple and accessible explanation of cluster sampling and related technical issues. The concepts of design effect, intra-cluster correlation (ICC) and sampling weights are discussed. A child transportation study) and an adolescent health survefl-5 are re-analysed, first, to illustrate the statistical issues associated with cluster surveys and, second, to estimate design effects and ICCs associated with a number of outcomes. Our publication of estimated ICCs based on Australian data should assist other researchers in planning similar surveys. These estimates will also be useful in the planning of cluster randomised trials, for which the same information on intra-cluster correlation is needed.

[1]  Kosuke Imai,et al.  Survey Sampling , 1998, Nov/Dec 2017.

[2]  A. Winsor Sampling techniques. , 2000, Nursing times.

[3]  J B Carlin,et al.  Analysis of binary outcomes in longitudinal studies using weighted estimating equations and discrete-time survival methods: prevalence and incidence of smoking in an adolescent cohort. , 1999, Statistics in medicine.

[4]  J B Carlin,et al.  The course of early smoking: a population-based cohort study over three years. , 1998, Addiction.

[5]  J M Bland,et al.  The intracluster correlation coefficient in cluster randomisation , 1998, BMJ.

[6]  J M Bland,et al.  Analysis of a trial randomised in clusters , 1998, BMJ.

[7]  J. Carlin,et al.  Walking to school and traffic exposure in Australian children , 1977, Australian and New Zealand journal of public health.

[8]  F B Hu,et al.  Intraclass correlation estimates in a school-based smoking prevention study. Outcome and mediating variables, by sex and ethnicity. , 1996, American journal of epidemiology.

[9]  J. Carlin,et al.  Is smoking associated with depression and anxiety in teenagers? , 1996, American journal of public health.

[10]  A. Turner,et al.  A not quite as quick but much cleaner alternative to the Expanded Programme on Immunization (EPI) Cluster Survey design. , 1996, International journal of epidemiology.

[11]  E. Seneta,et al.  Development of sample size models for national general practice surveys. , 2010, Australian journal of public health.

[12]  R. Hall,et al.  A population-based survey of immunisation coverage in two-year-old children. , 2010, Australian journal of public health.

[13]  J. Carlin,et al.  Patterns of common drug use in teenagers. , 2010, Australian Journal of Public Health.

[14]  J. Katz,et al.  Sample-size implications for population-based cluster surveys of nutritional status. , 1995, The American journal of clinical nutrition.

[15]  P J Hannan,et al.  Intraclass correlation among common measures of adolescent smoking: estimates, correlates, and applications in smoking prevention studies. , 1994, American journal of epidemiology.

[16]  D. Jacobs,et al.  PARAMETERS TO AID IN THE DESIGN AND ANALYSIS OF COMMUNITY TRIALS: INTRACLASS CORRELATIONS FROM THE MINNESOTA HEART HEALTH PROGRAM , 1994, Epidemiology.

[17]  B. Flay,et al.  Project towards no tobacco use: 1-year behavior outcomes. , 1993, American journal of public health.

[18]  D R Jacobs,et al.  The Healthy Worker Project: a work-site intervention for weight control and smoking cessation. , 1993, American journal of public health.

[19]  A Donner,et al.  Sample size requirements for stratified cluster randomization designs. , 1992, Statistics in medicine.

[20]  J. Ludbrook PRACTICAL STATISTICS FOR MEDICAL RESEARCH , 1991 .

[21]  M. Edwardes,et al.  A randomized trial to evaluate the risk of gastrointestinal disease due to consumption of drinking water meeting current microbiological standards. , 1991, American journal of public health.

[22]  S. Bennett,et al.  A simplified general method for cluster-sample surveys of health in developing countries. , 1991, World health statistics quarterly. Rapport trimestriel de statistiques sanitaires mondiales.

[23]  A. Donner,et al.  Randomization by cluster. Sample size requirements and analysis. , 1981, American journal of epidemiology.