Semiparametric maximum likelihood for multi-phase response-selective sampling and missing data problems

Missing data is a common occurrence in most medical (or other) research data collection processes. Missing data patterns can be sometimes caused by design. In multi-phase response-selective sampling, the response variable Y and some easily-obtainable variables are fully observed for a first-phase sample or finite population. Some of the covariates of interest, X, which might be difficult or too expensive to obtain, are then observed at later phases for smaller subsamples. The selection of data at each phase is dependent on the categorical information of Y , with the choice of considering other fullyobserved discrete variables as a crossor post-stratification. In this thesis, we start from the simple case-control study in which the data are collected under a two-phase response-selective sampling scheme. Special conditions are considered subsequently to pose different questions of interest. In a secondary study, we may be interested in another binary response variable Y2, which is associated with the original case-control response Y in the data. Conventional logistic regression analysis can no longer provide consistent parameter estimates in this case. We also consider the situation in which the case-control status of each subject is actually defined by dichotomising a continuous variable which is potentially available in the population. Ignoring this source of information as in binary logistic regression typically results in a loss of efficiency which may be substantial. Linear regression analysis can be carried out to efficiently estimate the logistic model odds-ratios and their 95% confidence intervals. The behaviour of various methods of analyses and sampling strategies for linear models are also discussed. We finally consider three-phase response-selective sampling designs, methods of analyses and some applications of three-phase methods. Our main approach for data analysis is semiparametric maximum likelihood. Survey-weighted methods, as well as other semiparametric approaches, are also considered for comparison. Their relative efficiencies and robustness properties are investigated using a wide range of simulation studies and real