Information extraction from full text scientific articles: Where are the keywords?

BackgroundTo date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the effort of scanning full text articles is worthy, or whether the information that can be extracted from the different sections of an article can be relevant.ResultsIn this work we addressed those questions showing that the keyword content of the different sections of a standard scientific article (abstract, introduction, methods, results, and discussion) is very heterogeneous.ConclusionsAlthough the abstract contains the best ratio of keywords per total of words, other sections of the article may be a better source of biologically relevant data.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[2]  Simon St Laurent XML elements of style , 1999 .

[3]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[4]  Michael B. Eisen,et al.  Public-access group supports PubMed Central , 2002, Nature.

[5]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[6]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Alcino J. Silva,et al.  Learning deficits, but normal development and tumor predisposition, in mice lacking exon 23a of Nf1 , 2001, Nature Genetics.

[8]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[9]  Peer Bork,et al.  Computing fuzzy associations for the analysis of biological literature. , 2002, BioTechniques.

[10]  Mark Ettinger The complexity of comparing reaction systems , 2002, Bioinform..

[11]  Joel D. Martin,et al.  Literature mining in molecular biology , 2002 .

[12]  Alfonso Valencia,et al.  Information extraction in molecular biology , 2002, Briefings Bioinform..

[13]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[14]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[15]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[16]  J Chory,et al.  Renaming Genes and Duplication of Gene Names in the Literature , 2001, The Plant Cell Online.

[17]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[18]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[19]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..