Development and Validation of Various Phenotyping Algorithms for Diabetes Mellitus Using Data from Electronic Health Records

Precision medicine requires extremely large samples. Electronic health records (EHR) are thought to be a cost-effective source of data for that purpose. Phenotyping algorithms help reduce classification errors, making EHR a more reliable source of information for research. Four algorithm development strategies for classifying patients according to their diabetes status (diabetics; non-diabetics; inconclusive) were tested (one codes-only algorithm; one boolean algorithm, four statistical learning algorithms and six stacked generalization meta-learners). The best performing algorithms within each strategy were tested on the validation set. The stacked generalization algorithm yielded the highest Kappa coefficient value in the validation set (0.95 95% CI 0.91, 0.98). The implementation of these algorithms allows for the exploitation of data from thousands of patients accurately, greatly reducing the costs of constructing retrospective cohorts for research.