论文信息 - Stepwise Variable Selection for Loglinear Mixture in Record Linkage

Stepwise Variable Selection for Loglinear Mixture in Record Linkage

A model building strategy is proposed to improve the probabilistic match in record linkage with focus on the loglinear mixture model of two components, each for the matched and unmatched pairs respectively. In reality, the comparison attributes (i.e., covariates) often interact each other, leading to more or less interactions in the loglinear models for both matched and unmatched pairs. However, the interactions patterns often are not the same for both components. Particularly, because the number of matched pairs is very small comparing with that of unmatched pairs in a real case, the model for matched pairs can not be fitted with the same higher order interactions as that for the unmatched pairs. The proposed strategy attempts to avoid both underfitting and overfitting due to subjective model specification for the data. Unlike the subjective specification, this strategy is data-driven. Starting from the situation of no interaction, we add interactions sequentially in two loglinear components using the forward selection approach. To this end, we define the alternatively climbing pathways through mixture families of two components with higher order interactions. The mixture models expanded along a pathway are nested successively, thus, conventional tests used for nested models can be applied. Regarding parameter estimation for the mixture, a simplified method (including the choice of initial values of parameters) for the EM algorithm is developed, which facilitates the mixture model fitting using existing packages and functions in sophisticated statistical software such as R. Simulation study has then been conducted for various situations to assess the model selection approach, and comparison of these selected models with the naive model assuming field independence has been made. We apply this strategy to the record linkage case study in SSC 2006 and have identified interactions among certain comparison attributes for both matched and unmatched pairs, these interaction patterns are not always the same for both matched and unmatched pairs.

[1] W. Winkler. USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[2] Modeling Issues and the Use ofExperience in Record Linkage , 2000 .

[3] W. Winkler. Overview of Record Linkage and Current Research Directions , 2006 .

[4] Yves Thibaudeau. The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[5] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[6] D. Rubin,et al. Iterative Automated Record Linkage Using Mixture Models , 2001 .

[7] Howard B. Newcombe,et al. Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[8] H B NEWCOMBE,et al. Automatic linkage of vital records. , 1959, Science.

[9] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .