Multivariate Analysis of Finnish Dialect Data - An Overview of Lexical Variation

During the process of writing a comprehensive dictionary of Finnish dialects, a large set of maps describing the regional distribution of the dialect words have been compiled in electronic form. In this article, we set out to analyse this corpus of data in order to gain new insight on the variation of Finnish dialects. We use a wide range of multivariate data analysis methods, including principal components analysis, independent components analysis, clustering, and multidimensional scaling. We explain how to preprocess the data to overcome the problem of uneven sampling caused by the way the data has been collected. We discuss the results obtained by these methods and compare them to the traditional view of Finnish dialect groups.

[1]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[2]  Eric S. Wheeler,et al.  Computerized Dialect Atlas of Finnish: Dealing with Ambiguity , 2000, Journal of Quantitative Linguistics.

[3]  Tuomo Tuomi,et al.  Suomen murteiden sanakirja , 1985 .

[4]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[5]  Esa Toom,et al.  Kotimaisten kielten tutkimuskeskus , 2004 .

[6]  E. Oja,et al.  Independent Component Analysis , 2013 .

[7]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[8]  Ilkka Savijärvi,et al.  Jämsän äijän murrekirja , 1994 .

[9]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[10]  John Nerbonne Proceedings of the 10th Meeting of the European Chapter of the Association for Computational Linguistics , 2003 .

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Antti Leino,et al.  Mitä murteita suomessa onkaan? Murresanaston levikin kvantitatiivista analyysiä , 2006 .

[13]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Lisa Lena Opas-Hänninen,et al.  Neighbours or Enemies? Competing Variants Causing Differences in Transitional Dialects , 2003, Comput. Humanit..

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  W. Heeringa,et al.  Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data , 2004, Language Variation and Change.

[17]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  John Nerbonne,et al.  Linguistic Variation and Computation (Invited talk) , 2003, EACL.

[20]  Martti Rapola,et al.  Johdatus suomen murteisiin , 1947 .

[21]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[22]  R. Shackleton,et al.  English-American Speech Relationships , 2005 .

[23]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[24]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[25]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[26]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[27]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[28]  John Nerbonne,et al.  Identifying Linguistic Structure in Aggregate Comparison , 2006, Lit. Linguistic Comput..

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[31]  John Nerbonne,et al.  Dialect areas and dialect continua , 2001, Language Variation and Change.

[32]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[33]  T. Itkonen,et al.  Proto-Finnic Final Consonants , 1965 .

[34]  W. Heeringa,et al.  Computational Comparison and Classification of Dialects , 2001 .

[35]  Valerie M. Jones,et al.  Cluster Analysis of the Newcastle Electronic Corpus of Tyneside English: A Comparison of Methods , 2005 .

[36]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[37]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[38]  Eric S. Wheeler,et al.  Finnish Dialect Atlas for Quantitative Studies , 1997, J. Quant. Linguistics.