论文信息 - Uncovering Machine Learning-Ready Data from Public Clinical Trial Resources: A case-study on normalization across Aggregate Content of ClinicalTrials.gov

Uncovering Machine Learning-Ready Data from Public Clinical Trial Resources: A case-study on normalization across Aggregate Content of ClinicalTrials.gov

The state of clinical data is a barrier to the development of machine learning models to improve healthcare. Uncontrolled clinical freetext is common in both the patient and clinical trials: the resulting spelling, grammatical errors, phrasing variation, and other resulting variability results in difficult-to-leverage data. As part of our effort to harmonize the Aggregate Analysis of ClinicalTrials.gov (AACT) drop-withdrawal reasons to a controlled vocabulary, we explored two solutions. Elastic’s fuzzy matching capability matched entries in the AACT Drop-Withdrawal table to a list of user-specified terms (74.6% coverage). The second approach was a custom pipeline employing NLP preprocessing, Levenshtein Distance (Fuzzy Matching), and semantic similarity mapping using a pre-trained FastText Model (98% coverage). Although manual oversight is still required, the amount of effort to harmonize with a controlled vocabulary is notably reduced. This work enables the rapid harmonization of clinical databases, allowing them to be leveraged for machine learning and analytics.