Statistical Analysis of the Indus Script Using n-Grams

The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.

[1]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2]  H. Tong Determination of the order of a Markov chain by Akaike's information criterion , 1975, Journal of Applied Probability.

[3]  Nisha Yadav,et al.  SEGMENTATION OF INDUS TEXTS , 2008 .

[4]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[5]  D. Crystal The Cambridge Encyclopedia of the English Language , 1998 .

[6]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[7]  A. Parpola,et al.  Deciphering the Indus Script , 1996 .

[8]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[9]  S. R. Rao,et al.  The Decipherment of the Indus Script , 1985 .

[10]  Rajesh P N Rao,et al.  A Markov model of the Indus script , 2009, Proceedings of the National Academy of Sciences.

[11]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[12]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[13]  Henri Parmentier Memoirs of the Archaeological Survey of India , 1922 .

[14]  Raj Kumar Pan,et al.  Network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilization inscriptions , 2009, Graph-based Methods for Natural Language Processing.

[15]  Richard Sproat,et al.  The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan Civilization , 2004 .

[16]  B. V. Subbarayappa Indus script : its nature and structure , 1996 .

[17]  Rajesh P. N. Rao,et al.  Entropic Evidence for Linguistic Structure in the Indus Script , 2009, Science.

[18]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[19]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[20]  Asko Parpola,et al.  Study of the Indus Script 1) , 2005 .

[21]  J. Kenoyer Ancient Cities of the Indus Valley Civilization , 1998 .

[22]  Iravatham Mahadevan Aryan or Dravidian or Neither? A Study of Recent Attempts to Decipher the Indus Script (1995-2000) , 2002 .

[23]  Iravatham Mahadevan,et al.  The Indus script : texts, concordance, and tables , 1977 .

[24]  Nisha Yadav,et al.  A STATISTICAL APPROACH FOR PATTERN SEARCH IN INDUS WRITING , 2008 .

[25]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[26]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[27]  Richard Salomon Indus Age: The Writing System (review) , 2000 .

[28]  James H Martin,et al.  Speech and Language Processing: an Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. Daniel Jurafsky & 4 N-grams , 2022 .

[29]  M. Morris,et al.  The Design , 1998 .