Application of Lexical Topic Models to Protein Interaction Sentence Prediction

Topic models can be used to improve classification of protein-protein interactions (PPIs) by condensing lexical knowledge available in unannotated biomedical text into a semantically-informed kernel smoothing matrix. Detection of sentences that describe PPIs is difficult due to lack of annotated data. Furthermore, sentences generally contain a small percentage of the features, thus leading to sparse training vectors. By exploiting contextual similarity of words we are able to improve the classification performance. This contextual data is gathered from a large unannotated corpus and incorporated through a semantic kernel. We use Hyperspace Analogue to Language (HAL) and Bound Encoding of the Aggregate Language Environment (BEAGLE) semantic models to create the kernels. The modularity of the method lends itself to further exploration along several different avenues including experimentation with any number of word and topic models.

[1]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[2]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[3]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[4]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[7]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[8]  Bin Zheng,et al.  BMC Bioinformatics BioMed Central , 2005 .

[9]  DM Blei,et al.  Second primary malignancy after Hodgkin's disease, ovarian cancer and cancer of the testis: a population-based cohort study. , 1987, British Journal of Cancer.

[10]  Xiaolong Wang,et al.  A Protein Classification Method Based on Latent Semantic Analysis* , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[11]  W. Kintsch,et al.  High-Dimensional Semantic Space Accounts of Priming. , 2006 .

[12]  Lehel Csató,et al.  Wikipedia-Based Kernels for Text Categorization , 2007, Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2007).

[13]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[14]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[15]  Sujeevan Aseervatham,et al.  A local Latent Semantic Analysis-based kernel for document similarities , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[17]  Simon Rogers,et al.  Classification of Protein Interaction Sentences via Gaussian Processes , 2009, PRIB.