论文信息 - Categorizing Children: Automated Text Classification of CHILDES files

Categorizing Children: Automated Text Classification of CHILDES files

In this paper we present the application of machine learning text classification methods to two tasks: categorization of children’s speech in the CHILDES Database according to gender and age. Both tasks are binary. For age, we distinguish two age groups between the age of 1.9 and 3.0 years old. The boundary between the groups lies at the age of 2.4 which is both the mean and the median of the age in our data set. We show that the machine learning approach, based on a bag of words, can achieve much better results than features such as average utterance length or Type-Token Ratio, which are methods traditionally used by linguists. We have achieved 80.5% and 70.5% classification accuracy for the age and gender task respectively.

[1] S. Gillis,et al. Kindertaalverwerving : een handboek voor het Nederlands , 2000 .

[2] Yoav Freund,et al. A Short Introduction to Boosting , 1999 .

[3] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .

[4] Shlomo Argamon,et al. Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[5] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6] B. MacWhinney. The CHILDES project: tools for analyzing talk , 1992 .

[7] Chih-Jen Lin,et al. A Practical Guide to Support Vector Classication , 2008 .

[8] Ronen Feldman,et al. Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[9] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.