Data cleaning and management protocols for linked perinatal research data: a good practice example from the Smoking MUMS (Maternal Use of Medications and Safety) Study

BackgroundData cleaning is an important quality assurance in data linkage research studies. This paper presents the data cleaning and preparation process for a large-scale cross-jurisdictional Australian study (the Smoking MUMS Study) to evaluate the utilisation and safety of smoking cessation pharmacotherapies during pregnancy.MethodsPerinatal records for all deliveries (2003–2012) in the States of New South Wales (NSW) and Western Australia were linked to State-based data collections including hospital separation, emergency department and death data (mothers and babies) and congenital defect notifications (babies in NSW) by State-based data linkage units. A national data linkage unit linked pharmaceutical dispensing data for the mothers. All linkages were probabilistic. Twenty two steps assessed the uniqueness of records and consistency of items within and across data sources, resolved discrepancies in the linkages between units, and identified women having records in both States.ResultsState-based linkages yielded a cohort of 783,471 mothers and 1,232,440 babies. Likely false positive links relating to 3703 mothers were identified. Corrections of baby’s date of birth and age, and parity were made for 43,578 records while 1996 records were flagged as duplicates. Checks for the uniqueness of the matches between State and national linkages detected 3404 ID clusters, suggestive of missed links in the State linkages, and identified 1986 women who had records in both States.ConclusionsAnalysis of content data can identify inaccurate links that cannot be detected by data linkage units that have access to personal identifiers only. Perinatal researchers are encouraged to adopt the methods presented to ensure quality and consistency among studies using linked administrative data.

[1]  F. Stanley,et al.  Pharmacovigilance in pregnancy using population‐based linked datasets , 2009, Pharmacoepidemiology and drug safety.

[2]  Janine A Clayton,et al.  Enrolling pregnant women: issues in clinical research. , 2013, Women's health issues : official publication of the Jacobs Institute of Women's Health.

[3]  W. Vach,et al.  A systematic approach to initial data analysis is good research practice. , 2016, The Journal of thoracic and cardiovascular surgery.

[4]  D. Preen,et al.  The Smoking MUMS (Maternal Use of Medications and Safety) Study: protocol for a population-based cohort study using linked administrative data , 2013, BMJ Open.

[5]  K. Harron,et al.  Linking Data for Mothers and Babies in De-Identified Electronic Health Data , 2016, PloS one.

[6]  S. Bewley,et al.  The efficient use of the maternity workforce and the implications for safety and quality in maternity care: a population-based, cross-sectional study , 2014 .

[7]  Daniele Pinto da Silveira,et al.  Perfeccionamiento en métodos de relacionamiento probabilístico de bases de datos en salud: revisión sistemática , 2009 .

[8]  Johannes B Reitsma,et al.  Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. , 2007, Journal of clinical epidemiology.

[9]  Ian Scott,et al.  Data Linkage: A powerful research tool with potential problems , 2010, BMC health services research.

[10]  A. J. Bass,et al.  Research use of linked health data — a best practice protocol , 2002, Australian and New Zealand journal of public health.

[11]  L. Taylor,et al.  Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. , 2006, Paediatric and perinatal epidemiology.

[12]  Roger Eeckels,et al.  Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities , 2005, PLoS medicine.

[13]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[14]  L. Taylor,et al.  Investigating linkage rates among probabilistically linked birth and hospitalization records , 2012, BMC Medical Research Methodology.

[15]  Daniele Pinto da Silveira,et al.  Accuracy of probabilistic record linkage applied to health databases: systematic review. , 2009, Revista de saude publica.

[16]  Sean M. Randall,et al.  Data linkage infrastructure for cross-jurisdictional health-related research in Australia , 2012, BMC Health Services Research.