Background: Structural reporting enables semantic understanding and prompt retrieval of clinical findings about patients. While synoptic pathology reporting provides templates for data entries, information in pathology reports remains primarily in narrative free text form. Extracting data of interest from narrative pathology reports could significantly improve the representation of the information and enable complex structured queries. However, manual extraction is tedious and error-prone, and automated tools are often constructed with a fixed training dataset and not easily adaptable. Our goal is to extract data from pathology reports to support advanced patient search with a highly adaptable semi-automated data extraction system, which can adjust and self-improve by learning from a user′s interaction with minimal human effort. Methods : We have developed an online machine learning based information extraction system called IDEAL-X. With its graphical user interface, the system′s data extraction engine automatically annotates values for users to review upon loading each report text. The system analyzes users′ corrections regarding these annotations with online machine learning, and incrementally enhances and refines the learning model as reports are processed. The system also takes advantage of customized controlled vocabularies, which can be adaptively refined during the online learning process to further assist the data extraction. As the accuracy of automatic annotation improves overtime, the effort of human annotation is gradually reduced. After all reports are processed, a built-in query engine can be applied to conveniently define queries based on extracted structured data. Results: We have evaluated the system with a dataset of anatomic pathology reports from 50 patients. Extracted data elements include demographical data, diagnosis, genetic marker, and procedure. The system achieves F-1 scores of around 95% for the majority of tests. Conclusions: Extracting data from pathology reports could enable more accurate knowledge to support biomedical research and clinical diagnosis. IDEAL-X provides a bridge that takes advantage of online machine learning based data extraction and the knowledge from human′s feedback. By combining iterative online learning and adaptive controlled vocabularies, IDEAL-X can deliver highly adaptive and accurate data extraction to support patient search.
[1]
Elena Paslaru Bontas Simperl,et al.
Feeding OWL: Extracting and Representing the Content of Pathology Reports
,
2004,
NLPXML@ACL.
[2]
James W. Cooper,et al.
Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model
,
2009,
J. Biomed. Informatics.
[3]
Jules J. Berman,et al.
Implementation and Evaluation of a Negation Tagger in a Pipeline-based System for Information Extraction from Pathology Reports
,
2004,
MedInfo.
[4]
Hinrich Schütze,et al.
Introduction to information retrieval
,
2008
.
[5]
Robert Eckstein,et al.
Synoptic reporting improves histopathological assessment of pancreatic resection specimens
,
2009,
Pathology.
[6]
Shai Shalev-Shwartz,et al.
Online Learning and Online Convex Optimization
,
2012,
Found. Trends Mach. Learn..
[7]
Michael Feldman,et al.
caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research
,
2010,
J. Am. Medical Informatics Assoc..
[8]
J. Srigley,et al.
Standardized synoptic cancer pathology reporting: A population‐based approach
,
2009,
Journal of surgical oncology.
[9]
John B. Moore,et al.
Hidden Markov Models: Estimation and Control
,
1994
.
[10]
Shai Shalev-Shwartz,et al.
Online learning: theory, algorithms and applications (למידה מקוונת.)
,
2007
.
[11]
Yuan Yao,et al.
Online Learning Algorithms
,
2006,
Found. Comput. Math..
[12]
Clement J. McDonald,et al.
Extracting Structured Information from Free Text Pathology Reports
,
2003,
AMIA.
[13]
K. Leslie,et al.
Standardization of the surgical pathology report: formats, templates, and synoptic reports.
,
1994,
Seminars in diagnostic pathology.
[14]
JOHANNES FÜRNKRANZ,et al.
Separate-and-Conquer Rule Learning
,
1999,
Artificial Intelligence Review.