@Note: A workbench for Biomedical Text Mining

Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.

[1]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[2]  Ekrem Varoglu,et al.  Recognizing Biomedical Named Entities Using SVMs: Improving Recognition Performance with a Minimal Set of Features , 2006, KDLL.

[3]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[4]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[5]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[8]  Fredrik Olsson,et al.  Notions of Correctness when Evaluating Protein Name Taggers , 2002, COLING.

[9]  Miguel Rocha,et al.  A tool for the automatic and manual annotation of biomedical documents , 2008, SMBM 2008.

[10]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[11]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[12]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[13]  Thomas Scheel,et al.  Automatic construction of gene relation networks using text mining and gene expression data , 2004, Medical informatics and the Internet in medicine.

[14]  Lawrence H. Smith,et al.  Identification of related gene/protein names based on an HMM of name variations , 2004, Comput. Biol. Chem..

[15]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[16]  K. Bretonnel Cohen,et al.  Natural Language Processing and Systems Biology , 2004, Artificial Intelligence Methods And Tools For Systems Biology.

[17]  Panagiotis Stamatopoulos,et al.  Summarization from Medical Documents: A Survey , 2005, Artif. Intell. Medicine.

[18]  Ralf Zimmer,et al.  Gene and protein nomenclature in public databases , 2006, BMC Bioinformatics.

[19]  Karin M. Verspoor,et al.  Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks , 2008, Genome Biology.

[20]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[21]  Yi Guan,et al.  Biomedical Named Entities Recognition Using Conditional Random Fields Model , 2006, FSKD.

[22]  J. Natarajan,et al.  Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications , 2005, Critical reviews in biotechnology.

[23]  Ted Briscoe,et al.  Natural Language Processing in aid of FlyBase curators , 2008, BMC Bioinformatics.

[24]  Karsten Hokamp,et al.  PubCrawler: keeping up comfortably with PubMed and GenBank , 2004, Nucleic Acids Res..

[25]  Snehasis Mukhopadhyay,et al.  A multi-level text mining method to extract biological relationships , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[26]  Christian Blaschke,et al.  Status of text-mining techniques applied to biomedical text. , 2006, Drug discovery today.

[27]  Thomas S. Morton,et al.  WordFreak: An Open Tool for Linguistic Annotation , 2003, HLT-NAACL.

[28]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[29]  T. Salakoski,et al.  Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation , 2005, BMC Bioinformatics.

[30]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[31]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[32]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[33]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[34]  Jun'ichi Tsujii,et al.  Improving the performance of dictionary-based approaches in protein name recognition , 2004, J. Biomed. Informatics.

[35]  Marti A. Hearst,et al.  TREC 2004 Genomics Track Overview , 2005, TREC.

[36]  Sougata Mukherjea,et al.  Enhancing a biomedical information extraction system with dictionary mining and context disambiguation , 2004, IBM J. Res. Dev..

[37]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[38]  Hans-Michael Müller,et al.  Automatic document classification of biological literature , 2006, BMC Bioinformatics.

[39]  Dipankar Chatterji,et al.  SHORTAGE OF NUTRIENTS IN BACTERIA : THE STRINGENT RESPONSE , 1998 .

[40]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[41]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[42]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[44]  Rolf Apweiler,et al.  Linking publication, gene and protein data , 2006, Nature Cell Biology.

[45]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[46]  Mary Anne Devanna,et al.  Strategic Human Resource Management , 1984 .

[47]  Jung-Hsien Chiang,et al.  Hierarchically SVM classification based on support vector clustering method and its application to document categorization , 2007, Expert Syst. Appl..

[48]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[49]  Jung-Hsien Chiang,et al.  Discovering gene-gene relations from sequential sentence patterns in biomedical literature , 2007, Expert Syst. Appl..

[50]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[51]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[52]  Hongfang Liu,et al.  Research Paper: Quantitative Assessment of Dictionary-based Protein Named Entity Tagging , 2006, J. Am. Medical Informatics Assoc..

[53]  Francisco Azuaje,et al.  Artificial Intelligence Methods And Tools For Systems Biology , 2004, Computational Biology.