Important biomedical information is often recorded, published or archived in unstructured and semi-structured textual form. Artificial intelligence and knowledge discovery techniques may be applied to large volumes of such data to identify and extract useful metadata, not only for providing access to these documents, but also for conducting analyses and uncovering patterns and trends in a field. The System for Preservation of Electronic Resources (SPER), an information management tool developed at the U.S. National Library of Medicine, provides these capabilities by integrating machine learning, data mining and digital preservation techniques. In this paper, we present an overview of SPER and its ability to retrieve information from one such dataset. We show how SPER was applied to the semi-structured records of an international health science program, the 46-year continuous archive of conference publications and related documents from the Joint Cholera Panel of the U.S.-Japan Cooperative Medical Science Program (CMSP). We explain the techniques by which metadata was extracted automatically from the semi-structured document contents to preserve these publications, and show how such data was used to quantitatively describe the activity of a research community toward a preliminary study of a subset of its specific health science program goals.
[1]
Ccsds Secretariat,et al.
Reference Model for an Open Archival Information System (OAIS)
,
1999
.
[2]
Biing-Hwang Juang,et al.
Fundamentals of speech recognition
,
1993,
Prentice Hall signal processing series.
[4]
Xiaoli Zhang,et al.
Investigator name recognition from medical journal articles: a comparative study of SVM and structural SVM
,
2010,
DAS '10.
[5]
Corinna Cortes,et al.
Support-Vector Networks
,
1995,
Machine Learning.
[6]
Siyuan Chen,et al.
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models.
,
2009,
Archiving : final program and proceedings. IS & T's Archiving Conference.