论文信息 - Language Segmentation

Language Segmentation

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text wrien in more than one language. is is important for all natural language processing tasks. e problem can be solved by training language models on language data. However, in the case of lowor no-resource languages, this is problematic. I therefore investigate whether unsupervised methods perform beer than supervised methods when it is difficult or impossible to train supervised approaches. A special focus is given to difficult texts, i.e. texts that are rather short (one sentence), containing abbreviations, low-resource languages and non-standard language. I compare three approaches: supervised n-gram language models, unsupervised clustering and weakly supervised n-gram language model induction. I devised the weakly supervised approach in order to deal with difficult text specifically. In order to test the approach, I compiled a small corpus of different text types, ranging from one-sentence texts to texts of about 300 words. e weakly supervised language model induction approach works well on short and difficult texts, outperforming the clustering algorithm and reaching scores in the vicinity of the supervised approach. e results look promising, but there is room for improvement and a more thorough investigation should be undertaken.

David Alfter | David Alfter

[1] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[2] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[3] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4] Yvan Saeys,et al. Java-ML: A Machine Learning Library , 2009, J. Mach. Learn. Res..

[5] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[6] Arkaitz Zubiaga,et al. Overview of TweetLID: Tweet Language Identification at SEPLN 2014 , 2014, TweetLID@SEPLN.

[7] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8] Naftali Tishby,et al. Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources , 2001, ICML.

[9] Christian Biemann,et al. Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[10] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[11] Timothy Baldwin,et al. Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[12] Elke Achtert,et al. Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[13] Yorick Wilks,et al. A Closer Look at Skip-gram Modelling , 2006, LREC.

[14] Riyaz Ahmad Bhat,et al. Language Identification in Code-Switching Scenario , 2014, CodeSwitch@EMNLP.

[15] Michael Clausen,et al. Algebraic complexity theory , 1997, Grundlehren der mathematischen Wissenschaften.

[16] Beatrice Alex,et al. An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[17] Thorsten Brants,et al. Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[18] William A. Gale,et al. Good-Turing Smoothing Without Tears , 2001 .

[19] Dana Ron,et al. The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[20] Ran El-Yaniv,et al. On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[21] David M. Carter,et al. Improving Language Models by Clustering Training Sentences , 1994, ANLP.

[22] Lichi Yuan. Language Model Based on Word Clustering , 2006, PACLIC.

[23] Beatrice Alex,et al. Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language , 2006 .

[24] Hermann Ney,et al. On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[25] Bernhard Schölkopf,et al. Support Vector Method for Novelty Detection , 1999, NIPS.

[26] Daniel Horowitz,et al. TweetSafa: Tweet Language Identification , 2014, TweetLID@SEPLN.

[27] RICHARD C. DUBES,et al. How many clusters are best? - An experiment , 1987, Pattern Recognit..

[28] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[29] Hiroshi Yamaguchi,et al. Text Segmentation by Language Using Minimum Description Length , 2012, ACL.

[30] Jianfeng Gao,et al. Language model size reduction by pruning and clustering , 2000, INTERSPEECH.

[31] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[32] Julia Hirschberg,et al. Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[33] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[34] Jordi Porta,et al. Twitter Language Identification using Rational Kernels and its potential application to Sociolinguistics , 2014, TweetLID@SEPLN.

[35] S. Marsland. Novelty Detection in Learning Systems , 2008 .

[36] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[37] Frank Keller,et al. Using Foreign Inclusion Detection to Improve Parsing Performance , 2007, EMNLP.

[38] Jianfeng Gao,et al. The Use of Clustering Techniques for Asian Language Modeling , 2001 .

[39] Sabine Brants,et al. The TIGER Treebank , 2001 .

[40] P. Grünwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[41] Fang Chen,et al. Improvements on hierarchical language identification based on automatic language clustering , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42] Ian J. Goodfellow,et al. Clustering Methods for Improving Language Models CS 224 N Natural Language Processing Final Project June , 2007 .

[43] Silke Wagner,et al. Comparing Clusterings - An Overview , 2007 .

[44] Haitao Liu,et al. Language clustering with word co-occurrence networks based on parallel texts , 2013 .

[45] Ben King,et al. Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[46] Sergei Vassilvitskii,et al. Parallel Algorithms for Unsupervised Tagging , 2014, Transactions of the Association for Computational Linguistics.