Language Segmentation

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text wrien in more than one language. is is important for all natural language processing tasks. e problem can be solved by training language models on language data. However, in the case of lowor no-resource languages, this is problematic. I therefore investigate whether unsupervised methods perform beer than supervised methods when it is difficult or impossible to train supervised approaches. A special focus is given to difficult texts, i.e. texts that are rather short (one sentence), containing abbreviations, low-resource languages and non-standard language. I compare three approaches: supervised n-gram language models, unsupervised clustering and weakly supervised n-gram language model induction. I devised the weakly supervised approach in order to deal with difficult text specifically. In order to test the approach, I compiled a small corpus of different text types, ranging from one-sentence texts to texts of about 300 words. e weakly supervised language model induction approach works well on short and difficult texts, outperforming the clustering algorithm and reaching scores in the vicinity of the supervised approach. e results look promising, but there is room for improvement and a more thorough investigation should be undertaken.

[1]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Yvan Saeys,et al.  Java-ML: A Machine Learning Library , 2009, J. Mach. Learn. Res..

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Arkaitz Zubiaga,et al.  Overview of TweetLID: Tweet Language Identification at SEPLN 2014 , 2014, TweetLID@SEPLN.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Naftali Tishby,et al.  Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources , 2001, ICML.

[9]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[10]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[11]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[12]  Elke Achtert,et al.  Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[13]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[14]  Riyaz Ahmad Bhat,et al.  Language Identification in Code-Switching Scenario , 2014, CodeSwitch@EMNLP.

[15]  Michael Clausen,et al.  Algebraic complexity theory , 1997, Grundlehren der mathematischen Wissenschaften.

[16]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[17]  Thorsten Brants,et al.  Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[18]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[19]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[20]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[21]  David M. Carter,et al.  Improving Language Models by Clustering Training Sentences , 1994, ANLP.

[22]  Lichi Yuan Language Model Based on Word Clustering , 2006, PACLIC.

[23]  Beatrice Alex,et al.  Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language , 2006 .

[24]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[25]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[26]  Daniel Horowitz,et al.  TweetSafa: Tweet Language Identification , 2014, TweetLID@SEPLN.

[27]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[28]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[29]  Hiroshi Yamaguchi,et al.  Text Segmentation by Language Using Minimum Description Length , 2012, ACL.

[30]  Jianfeng Gao,et al.  Language model size reduction by pruning and clustering , 2000, INTERSPEECH.

[31]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[32]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[33]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[34]  Jordi Porta,et al.  Twitter Language Identification using Rational Kernels and its potential application to Sociolinguistics , 2014, TweetLID@SEPLN.

[35]  S. Marsland Novelty Detection in Learning Systems , 2008 .

[36]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[37]  Frank Keller,et al.  Using Foreign Inclusion Detection to Improve Parsing Performance , 2007, EMNLP.

[38]  Jianfeng Gao,et al.  The Use of Clustering Techniques for Asian Language Modeling , 2001 .

[39]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[40]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[41]  Fang Chen,et al.  Improvements on hierarchical language identification based on automatic language clustering , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Ian J. Goodfellow,et al.  Clustering Methods for Improving Language Models CS 224 N Natural Language Processing Final Project June , 2007 .

[43]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[44]  Haitao Liu,et al.  Language clustering with word co-occurrence networks based on parallel texts , 2013 .

[45]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[46]  Sergei Vassilvitskii,et al.  Parallel Algorithms for Unsupervised Tagging , 2014, Transactions of the Association for Computational Linguistics.