OBJECTIVES
High-throughput techniques such as cDNA microarray, oligonucleotide arrays, and serial analysis of gene expression (SAGE) have been developed and used to automatically screen huge amounts of gene expression data. However, researchers usually spend lots of time and money on discovering gene-disease relationships by utilizing these techniques. We prototypically implemented an algorithm that can provide some kind of predicted results for biological researchers before they proceed with experiments, and it is very helpful for them to discover gene-disease relationships more efficiently.
METHODS
Due to the fast development of computer technology, many information retrieval techniques have been applied to analyze huge digital biomedical databases available worldwide. Therefore we highly expect that we can apply information retrieval (IR) technique to extract useful information for the relationship of specific diseases and genes from MEDLINE articles. Furthermore, we also applied natural language processing (NLP) methods to do the semantic analysis for the relevant articles to discover the relationships between genes and diseases.
RESULTS
We have extracted gene symbols from our literature collection according to disease MeSH classifications. We have also built an IR-based retrieval system, "Biomedical Literature Retrieval System (BLRS)" and applied the N-gram model to extract the relationship features which can reveal the relationship between genes and diseases. Finally, a relationship network of a specific disease has been built to represent the gene-disease relationships.
CONCLUSIONS
A relationship feature is a functional word that can reveal the relationship between one single gene and a disease. By incorporating many modern IR techniques, we found that BLRS is a very powerful information discovery tool for literature searching. A relationship network which contains the information on gene symbol, relationship feature, and disease MeSH term can provide an integrated view to discover gene-disease relationships.
[1]
Jung-Hsien Chiang,et al.
MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment
,
2003,
Bioinform..
[2]
T. Jenssen,et al.
A literature network of human genes for high-throughput analysis of gene expression
,
2001,
Nature Genetics.
[3]
P. Brown,et al.
Exploring the metabolic and genetic control of gene expression on a genomic scale.
,
1997,
Science.
[4]
Hong Yu,et al.
Extracting synonymous gene and protein terms from biological literature
,
2003,
ISMB.
[5]
Ioannis Xenarios,et al.
Mining literature for protein-protein interactions
,
2001,
Bioinform..
[6]
Toshihisa Takagi,et al.
Automated extraction of information on protein-protein interactions from the biological literature
,
2001,
Bioinform..
[7]
Berlin Chen,et al.
Lightly supervised and data-driven approaches to Mandarin broadcast news transcription
,
2004,
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[8]
Jun'ichi Tsujii,et al.
GENIA corpus - a semantically annotated corpus for bio-textmining
,
2003,
ISMB.
[9]
Miguel A. Andrade-Navarro,et al.
Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
,
1998,
Bioinform..
[10]
T. Jenssen,et al.
A literature network of human genes for high-throughput analysis of gene expression
,
2001
.
[11]
Lorraine K. Tanabe,et al.
Tagging gene and protein names in biomedical text
,
2002,
Bioinform..
[12]
Miguel A. Andrade-Navarro,et al.
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions
,
1999,
ISMB.
[13]
Sergei Egorov,et al.
MedScan, a natural language processing engine for MEDLINE abstracts
,
2003,
Bioinform..