Abstracts versus Full Texts and Patents: A Quantitative Analysis of Biomedical Entities

In information retrieval, named entity recognition gives the opportunity to apply semantic search in domain specific corpora. Recently, more full text patents and journal articles became freely available. As the information distribution amongst the different sections is unknown, an analysis of the diversity is of interest. This paper discovers the density and variety of relevant life science terminologies in Medline abstracts, PubMedCentral journal articles and patents from the TREC Chemistry Track. For this purpose named entity recognition for various bio, pharmaceutical, and chemical entity classes has been conducted and the frequencies and distributions in the different text zones analyzed. The full texts from PubMedCentral comprise information to a greater extent than their abstracts while containing almost all given content from their abstracts. In the patents from the TREC Chemistry Track, it is even more extrem. Especially the description section includes almost all entities mentioned in a patent and contains in comparison to the claim section at least 79 % of all entities exclusively.

[1]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[2]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[3]  Hitoshi Isahara,et al.  Chinese Named Entity Recognition with Conditional Random Fields , 2006, SIGHAN@COLING/ACL.

[4]  Darren J. Wilkinson,et al.  CaliBayes: Integration of GRID based simulation and data resources for Bayesian calibration of biological models , 2005, BMC Bioinformatics.

[5]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[6]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[7]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[8]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[9]  Juliane Fluck,et al.  ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries , 2007 .

[10]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[11]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[12]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[13]  Siegfried Benkner,et al.  @neuLink: A Service-oriented Application for Biomedical Knowledge Discovery , 2008, HealthGrid.

[14]  Martin Hofmann-Apitius,et al.  Named Entity Recognition with Combinations of Conditional Random Fields , 2007 .

[15]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[16]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[17]  Robert M. Seymour,et al.  Using large-scale perturbations in gene network reconstruction , 2005, BMC Bioinformatics.

[18]  Martin Hofmann-Apitius,et al.  Knowledge environments representing molecular entities for the virtual physiological human , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[19]  Laura Inés Furlong,et al.  Identifying gene-Specific Variations in Biomedical Text , 2007, J. Bioinform. Comput. Biol..

[20]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..