Chat mining: Predicting user and message attributes in computer-mediated communication

The focus of this paper is to investigate the possibility of predicting several user and message attributes in text-based, real-time, online messaging services. For this purpose, a large collection of chat messages is examined. The applicability of various supervised classification techniques for extracting information from the chat messages is evaluated. Two competing models are used for defining the chat mining problem. A term-based approach is used to investigate the user and message attributes in the context of vocabulary use while a style-based approach is used to examine the chat messages according to the variations in the authors' writing styles. Among 100 authors, the identity of an author is correctly predicted with 99.7% accuracy. Moreover, the reverse problem is exploited, and the effect of author attributes on computer-mediated communications is discussed.

[1]  Susan C. Herring,et al.  The Multilingual Internet: Language, Culture, and Communication Online , 2007 .

[2]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[3]  H. Love Attributing Authorship: An Introduction , 2002 .

[4]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[5]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[6]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[7]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Berkant Barla Cambazoglu,et al.  Chat Mining for Gender Prediction , 2006, ADVIS.

[10]  Michelle M. Kazmer,et al.  Do You Read Me? Perspective Making and Perspective Taking in Chat Communities , 2007, J. Assoc. Inf. Syst..

[11]  M. Teresa,et al.  Textual kidnapping revisited: the case of plagiarism in literary translation , 2004 .

[12]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[13]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[14]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[15]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[16]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[17]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[18]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[19]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[20]  Fazli Can,et al.  Change of Writing Style with Time , 2004, Comput. Humanit..

[21]  Ewa Jonsson,et al.  Electronic discourse : on speech and writing on the Internet , 1998 .

[22]  D. Holmes The Analysis of Literary Style — a Review , 1985 .

[23]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[24]  Joseph B. Walther,et al.  The Rules of Virtual Groups , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[25]  Fazli Can,et al.  A Stylometric Analysis of Yaşar Kemal’s İnce Memed Tetralogy , 2004, Comput. Humanit..

[26]  Susan C. Herring,et al.  GENDER ENCODING OF TYPOGRAPHICAL ELEMENTS IN LITHUANIAN AND CROATIAN IRC , 2006 .

[27]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[28]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[29]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[30]  E. Backer,et al.  Musical style recognition - a quantitative approach , 2004 .

[31]  Marie L. Radford,et al.  Encountering virtual users: A qualitative investigation of interpersonal communication in chat reference , 2006, J. Assoc. Inf. Sci. Technol..

[32]  Malcolm W. Corney,et al.  Analysing e-mail text authorship for forensic purposes , 2003 .

[33]  M. Teresa Turell Textual kidnapping revisited: the case of plagarism in literary translation , 2007 .

[34]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[35]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[36]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[37]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[38]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[39]  Marie L. Radford,et al.  Encountering virtual users: A qualitative investigation of interpersonal communication in chat reference: Research Articles , 2006 .

[40]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[41]  Graeme Hirst,et al.  Segmenting documents by stylistic character , 2005, Natural Language Engineering.

[42]  W. Chambers San Antonio, Texas , 1940 .

[43]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[44]  Yuta Tsuboi,et al.  Authorship identification for heterogeneous documents , 2002 .

[45]  Cevdet Aykanat,et al.  Harbinger Machine Learning Toolkit Manual ⋆ , 2005 .

[46]  Derek Abbott,et al.  Who wrote the "Letter to the Hebrews"?: data mining for detection of text authorship , 2005, SPIE Micro + Nano Materials, Devices, and Applications.

[47]  R. Thomson,et al.  Predicting gender from electronic discourse. , 2001, The British journal of social psychology.

[48]  John C. Paolillo,et al.  Gender and genre variation in weblogs , 2006 .

[49]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[50]  George M. Mohay,et al.  Language and Gender Author Cohort Analysis of E-mail for Computer Forensics , 2002 .

[51]  D. W. Foster Author Unknown: On the Trail of Anonymous , 2000 .

[52]  Donna Harman,et al.  Information Processing and Management , 2022 .

[53]  Shlomo Argamon,et al.  Fixing the Federalist : Correcting Results and Evaluating Editions for Automated Attribution , 2006 .

[54]  J. Burrows Computation into criticism : a study of Jane Austen's novels and an experiment in method , 1987 .

[55]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[56]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[57]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[58]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[59]  B. Kjell,et al.  Authorship attribution of text samples using neural networks and Bayesian classifiers , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[60]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[61]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.