Intelligent Text Processing Techniques for Textual-Profile Gene Characterization

We present a suite of Machine Learning and knowledge-based components for textual-profile based gene prioritization. Most genetic diseases are characterized by many potential candidate genes that can cause the disease. Gene expression analysis typically produces a large number of co-expressed genes that could be potentially responsible for a given disease. Extracting prior knowledge from text-based genomic information sources is essential in order to reduce the list of potential candidate genes to be then further analyzed in laboratory. In this paper we present a suite of Machine Learning algorithms and knowledge-based components for improving the computational gene prioritization process. The suite includes basic Natural Language Processing capabilities, advanced text classification and clustering algorithms, robust information extraction components based on qualitative and quantitative keyword extraction methods and exploitation of lexical knowledge bases for semantic text processing.

[1]  Hagit Shatkay,et al.  Information Retrieval Meets Gene Analysis , 2002, IEEE Intell. Syst..

[2]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[3]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[4]  Stefano Ferilli,et al.  A General Similarity Framework for Horn Clause Logic , 2009, Fundam. Informaticae.

[5]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[6]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[7]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[8]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[9]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[10]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[11]  Karen L. Mohlke,et al.  Data and text mining A computational system to select candidate genes for complex human traits , 2007 .

[12]  Stefano Ferilli,et al.  Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction , 2008, Machine Learning in Document Analysis and Recognition.

[13]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[14]  Simon Price,et al.  Inductive Logic Programming , 2000, Lecture Notes in Computer Science.

[15]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[16]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[17]  Hagit Shatkay,et al.  Information retrieval meets gene analysis , 2002 .

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[20]  Nicola Fanizzi,et al.  Incremental multistrategy learning for document processing , 2003, Appl. Artif. Intell..

[21]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[22]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[23]  Yasin Uzun 1 Keyword Extraction Using Naive Bayes , 2005 .

[24]  Carlo Strapparava,et al.  The role of domain information in Word Sense Disambiguation , 2002, Natural Language Engineering.

[25]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.