Extracting Formulaic and Free Text Clinical Research Articles Metadata using Conditional Random Fields

We explore the use of conditional random fields (CRFs) to automatically extract important metadata from clinical research articles. These metadata fields include formulaic meta-data about the authors, extracted from the title page, as well as free text fields concerning the study's critical parameters, such as longitudinal variables and medical intervention methods, extracted from the body text of the article. Extracting such information can help both readers conduct deep semantic search of articles and policy makers and sociologists track macro level trends in research. Preliminary results show an acceptable level of performance for formulaic metadata and a high precision for those found in the free text.

[1]  Wendy W. Chapman,et al.  Identifying Data Sharing in Biomedical Literature , 2008, AMIA.

[2]  Ying He,et al.  Biological Entity Recognition with Conditional Random Fields , 2008, AMIA.

[3]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[4]  H. Han,et al.  Automatic document meta-data extraction using support vector machines , 2003 .

[5]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[6]  Richard A Collins,et al.  A systematic evaluation of payback of publicly funded health and health services research in Hong Kong , 2007, BMC Health Services Research.

[7]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[10]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[13]  Wade S. Smith,et al.  Effect of a US National Institutes of Health programme of clinical trials on public health and costs , 2006, The Lancet.

[14]  M. Cappell,et al.  A Significant Decline in the American Domination of Research in Gastroenterology With Increasing Globalization From 1980 to 2005: An Analysis of American Authorship Among 8,251 Articles , 2008, The American Journal of Gastroenterology.

[15]  John M. Lin,et al.  An Analysis of the Abstracts Presented at the Annual Meetings of the Society for Neuroscience from 2001 to 2006 , 2007, PloS one.