论文信息 - Toward any-language zero-shot topic classification of textual documents - 字舞流文

Toward any-language zero-shot topic classification of textual documents

Abstract In this paper, we present a zero-shot classification approach to document classification in any language into topics which can be described by English keywords. This is done by embedding both labels and documents into a shared semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. The embedding space can be created by either mapping into a Wikipedia-based semantic representation or learning cross-lingual embeddings. But if the Wikipedia in the target language is small or there is not enough training corpus to train a good embedding space for low-resource languages, then performance can suffer. Thus, for low-resource languages, we further use a word-level dictionary to convert documents into a high-resource language, and then perform classification based on the high-resource language. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision-heavy approach feasible for about 100 languages. We also develop a ranking algorithm that makes use of language similarity metrics to automatically select a good pivot or bridging high-resource language, and show that this significantly improves classification of low-resource language documents, performing comparably to the best bridge possible.

Dan Roth | Haoruo Peng | Yangqiu Song | Shyam Upadhyay | Stephen Mayhew

[1] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2] Evgeniy Gabrilovich,et al. Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[3] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4] Philipp Cimiano,et al. Exploiting Wikipedia for cross-lingual and multilingual information retrieval , 2012, Data Knowl. Eng..

[5] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Nigel Collier,et al. Towards a Seamless Integration of Word Senses into Downstream NLP Applications , 2017, ACL.

[7] Yoshua Bengio,et al. Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[8] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[9] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10] Susan T. Dumais,et al. Hierarchical classification of Web content , 2000, SIGIR '00.

[11] Benno Stein,et al. Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[12] James Henderson,et al. A Model of Zero-Shot Learning of Spoken Language Understanding , 2015, EMNLP.

[13] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[14] Hitoshi Isahara,et al. A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[15] Steven Skiena,et al. Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[16] Ken Lang,et al. NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[17] Mirella Lapata,et al. Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[18] Roberto Navigli,et al. Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia , 2016, IJCAI.

[19] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[20] Dan Roth,et al. Cross-Lingual Dataless Classification for Many Languages , 2016, IJCAI.

[21] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[22] Simone Paolo Ponzetto,et al. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[23] Benno Stein,et al. A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[24] Alessandro Raganato,et al. Sew-Embed at SemEval-2017 Task 2: Language-Independent Concept Representations from a Semantically Enriched Wikipedia , 2017, SemEval@ACL.

[25] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[26] Dan Roth,et al. On Dataless Hierarchical Text Classification , 2014, AAAI.

[27] Georgiana Dinu,et al. Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[28] Ivan Titov,et al. Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[29] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[30] Ming-Wei Chang,et al. Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[31] Dan Roth,et al. Unsupervised Sparse Vector Densification for Short Text Similarity , 2015, NAACL.

[32] Manaal Faruqui,et al. Cross-lingual Models of Word Embeddings: An Empirical Comparison , 2016, ACL.

[33] Eiichiro Sumita,et al. How to Choose the Best Pivot Language for Automatic Translation of Low-Resource Languages , 2013, ACM Trans. Asian Lang. Inf. Process..

[34] David Yarowsky,et al. Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[35] Oren Etzioni,et al. Panlingual lexical translation via probabilistic inference , 2010, Artif. Intell..

[36] D. C. Howell. Statistical Methods for Psychology , 1987 .

[37] Geoffrey E. Hinton,et al. Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[38] S. Sathiya Keerthi,et al. Efficient algorithms for ranking with SVMs , 2010, Information Retrieval.

[39] Samuel L. Smith,et al. Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[40] Philip H. S. Torr,et al. An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[41] Percy Liang,et al. Semi-Supervised Learning for Natural Language , 2005 .

[42] Manaal Faruqui,et al. Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[43] Babak Saleh,et al. Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[44] Kevin Gimpel,et al. Deep Multilingual Correlation for Improved Word Embeddings , 2015, HLT-NAACL.

[45] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[46] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[47] ChengXiang Zhai,et al. Cross-Lingual Latent Topic Extraction , 2010, ACL.

[48] Min Xiao,et al. Semi-Supervised Representation Learning for Cross-Lingual Text Classification , 2013, EMNLP.

[49] Hua Wu,et al. Revisiting Pivot Language Approach for Machine Translation , 2009, ACL.

[50] Ido Dagan,et al. Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[51] Manik Varma,et al. Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[52] Lei Shi,et al. Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[53] Stephen D. Mayhew,et al. Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence , 2016, ArXiv.

[54] Thore Graepel,et al. Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[55] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[56] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57] Phil Blunsom,et al. Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[58] Ignacio Iacobacci,et al. SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[59] Takahiro Hara,et al. MLJ: Language-Independent Real-Time Search of Tweets Reported by Media Outlets and Journalists , 2014, Proc. VLDB Endow..

[60] Hermann Ney,et al. Multi-pivot translation by system combination , 2010, IWSLT.

[61] Massih-Reza Amini,et al. A co-classification approach to learning from multilingual corpora , 2010, Machine Learning.

[62] Roberto Navigli,et al. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[63] Ignacio Iacobacci,et al. Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[64] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[65] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[66] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.