Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature

Understanding functional gene relationships is a challeng ing problem for biological applications. High-throughput technologies such as DNA micr oar ays have inundated biologists with a wealth of information, however, processing tha t information remains problematic. To help with this problem, researchers have begun appl ying text mining techniques to the biological literature. This work extends previous wo rk based on Latent Semantic Indexing (LSI) by examining Nonnegative Matrix Factorizat on (NMF). Whereas LSI incorporates the singular value decomposition (SVD) to appro ximate data in a dense, mixedsign space, NMF produces a parts-based factorization that i s directly interpretable. This space can, in theory, be used to augment existing ontologies and annotations by identifying themes within the literature. Of course, performing NMF doe s not come without a price— namely, the large number of parameters. This work attempts t o analyze the effects of some of the NMF parameters on both convergence and labeling accur y. Since there is a dearth of automated label evaluation techniques as well as “gold st andard” hierarchies, a method to produce “correct” trees is proposed as well as a technique to label trees and to evaluate those labels.

[1]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[2]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[3]  K. Kidd,et al.  Phylogenetic analysis: concepts and methods. , 1971, American journal of human genetics.

[4]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[5]  D. Robinson Comparison of labeled trees with valency three , 1971 .

[6]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  Michael Randolph Garey,et al.  The complexity of the generalized Lloyd - Max problem , 1982, IEEE Trans. Inf. Theory.

[9]  J. Felsenstein Numerical Methods for Inferring Evolutionary Trees , 1982, The Quarterly Review of Biology.

[10]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[11]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Michael R. Fellows,et al.  Two Strikes Against Perfect Phylogeny , 1992, ICALP.

[14]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[15]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[16]  G Salton Performance of text retrieval systems. , 1995, Science.

[17]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[18]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[19]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[20]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[21]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[22]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[23]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[24]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[25]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[26]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[27]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[28]  Michael W. Berry,et al.  A Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space , 2003, J. Bioinform. Comput. Biol..

[29]  Stefan M. Wild,et al.  Motivating non-negative matrix factorizations , 2003 .

[30]  D. J. Carrigan,et al.  Role of Nuclear Factor-κB in the Antiviral Action of Interferon and Interferon-regulated Gene Expression* , 2004, Journal of Biological Chemistry.

[31]  V. P. Pauca,et al.  Object Characterization from Spectral Data Using Nonnegative Factorization and Information Theory , 2004 .

[32]  Kevin Erich Heinrich,et al.  Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) , 2004 .

[33]  Stefan M. Wild,et al.  Improving non-negative matrix factorizations through structured initialization , 2004, Pattern Recognit..

[34]  Jon M. Kleinberg,et al.  A Microeconomic View of Data Mining , 1998, Data Mining and Knowledge Discovery.

[35]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[36]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[37]  Mark Sandler,et al.  On the use of linear programming for unsupervised text classification , 2005, KDD '05.

[38]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[39]  Svetlana Kiritchenko,et al.  Hierarchical text categorization and its application to bioinformatics , 2006 .

[40]  C. D. Meyer,et al.  Initializations for the Nonnegative Matrix Factorization , 2006 .

[41]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[43]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[44]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[45]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[46]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..