Modelling Noun-Phrase Dynamics in Specialized Text Collections

Abstract The science of biology has entered a new era with new approaches for information processing frameworks and high-throughput experiments. This has led to a high rate of publication production and the emergence of large accessible databases in English, permitting the creation of text collections in any specialized domain. To process such text data, systematic analysis of language properties is helpful and benefits from a distribution description. In this article, firstly, as scientific publications are time-stamped we can analyse distribution profiles of noun-phrases (i.e. “content-words”) over time. Hence, time-dependency analysis of noun-phrases reveals interesting specific behaviour taking into account sequential occurrence of features. Single content-word distributions appear to be linearly shaped. We also observed that the association of content-words is distributed in a different way over time, i.e. as a mixed beta distribution.

[1]  Adam Pawlowski Language in the Line vs. Language in the Mass: On the Efficiency of Sequential Modelling in the Analysis of Rhytm , 1999, J. Quant. Linguistics.

[2]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[3]  Bob Carpenter,et al.  Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval , 2004, TREC.

[4]  Reinhard Köhler,et al.  Zur linguistischen Synergetik : Struktur und Dynamik der Lexik , 1986 .

[5]  A. L. Kroeber,et al.  Quantitative Classification of Indo-European Languages , 1937 .

[6]  Burkhard Rost,et al.  NLProt: extracting protein names and sequences from papers , 2004, Nucleic Acids Res..

[7]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[8]  Gabriel Altmann,et al.  Probability Distributions of Syntactic Units and Properties* , 2000, J. Quant. Linguistics.

[9]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[10]  Adam Pawlowski Time-Series Analysis in Linguistics: Application of the ARIMA Method to Cases of Spoken Polish , 1997, J. Quant. Linguistics.

[11]  R. Baayen,et al.  Chronicling the Times: Productive Lexical Innovations in an English Newspaper , 1996 .

[12]  Anatolij A. Polikarpov A Model of the Word Life Cycle , 1993 .

[13]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Jean Dubois,et al.  Étude sur la dérivation suffixale en français moderne et contemporain : essai d'interprétation des mouvenents observés dans le domainede la morphologie des mots construits , 1962 .

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[18]  O. Jespersen Growth and Structure of the English Language , 1948 .

[19]  Reinhard Köhler,et al.  Syntactic Structures: Properties and Interrelations , 1999 .

[20]  W. Labov Locating Language in Time and Space , 1980 .

[21]  G. Zipf The meaning-frequency relationship of words. , 1945, The Journal of general psychology.

[22]  Ferdinand de Saussure Course in General Linguistics , 1916 .

[23]  W J Gingerich,et al.  Methodological observations on applied behavioral science. , 1984, The Journal of applied behavioral science.

[24]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[25]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[27]  Nicolas Turenne,et al.  BELUGA : un outil pour l'analyse dynamique des connaissances de la littérature scientifique d'un domaine - Première application au cas des maladies à prions , 2004, EGC.