Toward any-language zero-shot topic classification of textual documents

Abstract In this paper, we present a zero-shot classification approach to document classification in any language into topics which can be described by English keywords. This is done by embedding both labels and documents into a shared semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. The embedding space can be created by either mapping into a Wikipedia-based semantic representation or learning cross-lingual embeddings. But if the Wikipedia in the target language is small or there is not enough training corpus to train a good embedding space for low-resource languages, then performance can suffer. Thus, for low-resource languages, we further use a word-level dictionary to convert documents into a high-resource language, and then perform classification based on the high-resource language. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision-heavy approach feasible for about 100 languages. We also develop a ranking algorithm that makes use of language similarity metrics to automatically select a good pivot or bridging high-resource language, and show that this significantly improves classification of low-resource language documents, performing comparably to the best bridge possible.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  Philipp Cimiano,et al.  Exploiting Wikipedia for cross-lingual and multilingual information retrieval , 2012, Data Knowl. Eng..

[5]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Nigel Collier,et al.  Towards a Seamless Integration of Word Senses into Downstream NLP Applications , 2017, ACL.

[7]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[8]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[11]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[12]  James Henderson,et al.  A Model of Zero-Shot Learning of Spoken Language Understanding , 2015, EMNLP.

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[15]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[16]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[17]  Mirella Lapata,et al.  Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[18]  Roberto Navigli,et al.  Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia , 2016, IJCAI.

[19]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[20]  Dan Roth,et al.  Cross-Lingual Dataless Classification for Many Languages , 2016, IJCAI.

[21]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[22]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[23]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[24]  Alessandro Raganato,et al.  Sew-Embed at SemEval-2017 Task 2: Language-Independent Concept Representations from a Semantically Enriched Wikipedia , 2017, SemEval@ACL.

[25]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[26]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[27]  Georgiana Dinu,et al.  Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[28]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[29]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[30]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[31]  Dan Roth,et al.  Unsupervised Sparse Vector Densification for Short Text Similarity , 2015, NAACL.

[32]  Manaal Faruqui,et al.  Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[33]  Eiichiro Sumita,et al.  How to Choose the Best Pivot Language for Automatic Translation of Low-Resource Languages , 2013, ACM Trans. Asian Lang. Inf. Process..

[34]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[35]  Oren Etzioni,et al.  Panlingual lexical translation via probabilistic inference , 2010, Artif. Intell..

[36]  D. C. Howell Statistical Methods for Psychology , 1987 .

[37]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[38]  S. Sathiya Keerthi,et al.  Efficient algorithms for ranking with SVMs , 2010, Information Retrieval.

[39]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[40]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[41]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[42]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[43]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  Kevin Gimpel,et al.  Deep Multilingual Correlation for Improved Word Embeddings , 2015, HLT-NAACL.

[45]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[46]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[47]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[48]  Min Xiao,et al.  Semi-Supervised Representation Learning for Cross-Lingual Text Classification , 2013, EMNLP.

[49]  Hua Wu,et al.  Revisiting Pivot Language Approach for Machine Translation , 2009, ACL.

[50]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[51]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[52]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[53]  Stephen D. Mayhew,et al.  Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence , 2016, ArXiv.

[54]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[55]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[56]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[58]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[59]  Takahiro Hara,et al.  MLJ: Language-Independent Real-Time Search of Tweets Reported by Media Outlets and Journalists , 2014, Proc. VLDB Endow..

[60]  Hermann Ney,et al.  Multi-pivot translation by system combination , 2010, IWSLT.

[61]  Massih-Reza Amini,et al.  A co-classification approach to learning from multilingual corpora , 2010, Machine Learning.

[62]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[63]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[64]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[65]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[66]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.