Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations

BackgroundThe construction of literature-based networks of gene-gene interactions is one of the most important applications of text mining in bioinformatics. Extracting potential gene relationships from the biomedical literature may be helpful in building biological hypotheses that can be explored further experimentally. Recently, latent semantic indexing based on the singular value decomposition (LSI/SVD) has been applied to gene retrieval. However, the determination of the number of factors k used in the reduced rank matrix is still an open problem.ResultsIn this paper, we introduce a way to incorporate a priori knowledge of gene relationships into LSI/SVD to determine the number of factors. We also explore the utility of the non-negative matrix factorization (NMF) to extract unrecognized gene relationships from the biomedical literature by taking advantage of known gene relationships. A gene retrieval method based on NMF (GR/NMF) showed comparable performance with LSI/SVD.ConclusionUsing known gene relationships of a given gene, we can determine the number of factors used in the reduced rank matrix and retrieve unrecognized genes related with the given gene by LSI/SVD or GR/NMF.

[1]  T. Südhof,et al.  Cleavage of amyloid-beta precursor protein and amyloid-beta precursor-like protein by BACE 1. , 2004, The Journal of biological chemistry.

[2]  R. Bro,et al.  A fast non‐negativity‐constrained least squares algorithm , 1997 .

[3]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[4]  M. V. Van Benthem,et al.  Fast algorithm for the solution of large‐scale non‐negativity‐constrained least squares problems , 2004 .

[5]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[6]  Hyunsoo Kim,et al.  Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations , 2008, BMC Bioinformatics.

[7]  Philip M. Kim,et al.  Subsystem identification through dimensionality reduction of large-scale gene expression data. , 2003, Genome research.

[8]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[9]  T. Curran,et al.  Role of the reelin signaling pathway in central nervous system development. , 2001, Annual review of neuroscience.

[10]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[11]  Christian Haass,et al.  Amyloid Precursor-like Protein 1 Influences Endocytosis and Proteolytic Processing of the Amyloid Precursor Protein* , 2006, Journal of Biological Chemistry.

[12]  A. Goffinet,et al.  Reelin and brain development , 2003, Nature Reviews Neuroscience.

[13]  Efstratios Gallopoulos,et al.  Design of a matlab tool-box for term-document matrix generation , 2005 .

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[16]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[17]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Joachim Herz,et al.  Reelin Activates Src Family Tyrosine Kinases in Neurons , 2003, Current Biology.

[19]  BMC Bioinformatics , 2005 .

[20]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[21]  R. Plemmons,et al.  NONNEGATIVE MATRIX FACTORIZATION AND APPLICATIONS , 2005 .

[22]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[23]  Lionel Arnaud,et al.  Fyn Tyrosine Kinase Is a Critical Regulator of Disabled-1 during Brain Development , 2003, Current Biology.

[24]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[25]  T. Curran,et al.  Cyclin-Dependent Kinase 5 Phosphorylates Disabled 1 Independently of Reelin Signaling , 2002, The Journal of Neuroscience.

[26]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[27]  Li-Huei Tsai,et al.  Cdk5: one of the links between senile plaques and neurofibrillary tangles? , 2003, Journal of Alzheimer's disease : JAD.

[28]  Michael W. Berry,et al.  Text Mining Using Non-Negative Matrix Factorizations , 2004, SDM.

[29]  T. Südhof,et al.  Cleavage of Amyloid-β Precursor Protein and Amyloid-β Precursor-like Protein by BACE 1* , 2004, Journal of Biological Chemistry.