Multi-view Ensemble Classification for Clinically Actionable Genetic Mutations

This paper presents details of our solutions to the task IV of NIPS 2017 Competition Track that is called Classifying Clinically Actionable Genetic Mutations. It aims at classifying genetic mutations based on text evidence from clinical literature. A novel multi-view machine learning framework with ensemble classification models is proposed to solve this problem. During this Challenge, feature combinations deriving from three views including document view, entity text view, and entity name view to complement each other are comprehensively explored. Finally, an ensemble of 9 basic gradient boosting models win in the comparisons. Our approach scored 0.5506 and 0.6694 in Logarithmic Loss on a fixed split of stage-1 testing phase and 5-fold cross validation respectively, which is ranked as a top-3 team in NIPS 2017 Competition Track IV.

[1]  Nanyun Peng,et al.  Cross-Sentence N-ary Relation Extraction with Graph LSTMs , 2017, TACL.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[4]  David A. C. Manning,et al.  Introduction to Industrial Minerals , 1994 .

[5]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[6]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[7]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[10]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[11]  Hanqing Lu,et al.  Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[12]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[16]  Marc'Aurelio Ranzato,et al.  Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews , 2014, ICLR.

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[19]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[20]  S. Lowe,et al.  A microRNA polycistron as a potential human oncogene , 2005, Nature.

[21]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[22]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[23]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.