Robust Text Classification for Sparsely Labelled Data Using Multi-level Embeddings

The conventional solution for handling sparsely labelled data is extensive feature engineering. This is time consuming and task and domain specific. We present a novel approach for learning embedded features that aims to alleviate this problem. Our approach jointly learns embeddings at different levels of granularity (word, sentence and document) along with the class labels. The intuition is that topic semantics represented by embeddings at multiple levels results in better classification. We evaluate this approach in unsupervised and semi-supervised settings on two sparsely labelled classification tasks, outperforming the handcrafted models and several embedding baselines.

[1]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..

[4]  Mihai Surdeanu,et al.  Event Extraction Using Distant Supervision , 2014, LREC.

[5]  Anna Korhonen,et al.  CRAB Reader: A Tool for Analysis and Visualization of Argumentative Zones in Scientific Literature , 2012, COLING.

[6]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[7]  Stephen Clark,et al.  Supertagging for Combinatory Categorial Grammar , 2002, TAG+.

[8]  Mingyuan Yang,et al.  Learning Document Semantic Representation with Hybrid Deep Belief Network , 2015, Comput. Intell. Neurosci..

[9]  Anna Korhonen,et al.  Improving Verb Clustering with Automatically Acquired Selectional Preferences , 2009, EMNLP.

[10]  Noah A. Smith,et al.  Linguistic Structured Sparsity in Text Categorization , 2014, ACL.

[11]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Noémie Elhadad,et al.  Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts , 2013, J. Biomed. Informatics.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[17]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[18]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[20]  Maria Liakata,et al.  Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes , 2010, BioNLP@ACL.

[21]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[22]  Yuval Shahar,et al.  An Active Learning Framework for Efficient Condition Severity Classification , 2015, AIME.

[23]  Anna Korhonen,et al.  Weakly supervised learning of information structure of scientific abstracts - is it accurate enough to benefit real-world tasks in biomedicine? , 2011, Bioinform..

[24]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[25]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[26]  Noah A. Smith,et al.  Learning Word Representations with Hierarchical Sparse Coding , 2014, ICML.

[27]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[28]  Horacio Rodríguez,et al.  Medical Entities Tagging Using Distant Learning , 2015, CICLing.

[29]  Anna Korhonen,et al.  Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review , 2013, Bioinform..

[30]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.

[31]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[32]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[33]  YanYan,et al.  Learning document semantic representation with hybrid deep belief network , 2015 .

[34]  K. Polyak,et al.  Intra-tumour heterogeneity: a looking glass for cancer? , 2012, Nature Reviews Cancer.

[35]  Xuanjing Huang,et al.  Text Classification with Document Embeddings , 2014, CCL.

[36]  M. Wang,et al.  An Unsupervised Text Mining Method for Relation Extraction from Biomedical Literature , 2014, PloS one.

[37]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[38]  Anna Korhonen,et al.  Unsupervised discovery of information structure in biomedical documents , 2015, Bioinform..

[39]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[40]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.