Twitter user profiling based on text and community mining for market analysis

This paper proposes demographic estimation algorithms for profiling Twitter users, based on their tweets and community relationships. Many people post their opinions via social media services such as Twitter. This huge volume of opinions, expressed in real time, has great appeal as a novel marketing application. When automatically extracting these opinions, it is desirable to be able to discriminate discrimination based on user demographics, because the ratio of positive and negative opinions differs depending on demographics such as age, gender, and residence area, all of which are essential for market analysis. In this paper, we propose a hybrid text-based and community-based method for the demographic estimation of Twitter users, where these demographics are estimated by tracking the tweet history and clustering of followers/followees. Our experimental results from 100,000 Twitter users show that the proposed hybrid method improves the accuracy of the text-based method. The proposed method is applicable to various user demographics and is suitable even for users who only tweet infrequently.

[1]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[2]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[3]  Hirotugu Akaike,et al.  Analysis of cross classified data by AIC , 1978 .

[4]  Tossapon Boongoen,et al.  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[5]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[6]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[7]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[8]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[9]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  H. Akaike A new look at the statistical model identification , 1974 .

[11]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[12]  Ellen Riloff,et al.  Finding Mutual Benefit between Subjectivity Analysis and Information Extraction , 2011, IEEE Transactions on Affective Computing.

[13]  Kentaro Inui,et al.  Identifying Information Sender Configuration of Web Pages , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006 .

[15]  Son Bao Pham,et al.  Author Profiling for Vietnamese Blogs , 2009, 2009 International Conference on Asian Language Processing.

[16]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[17]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[18]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[19]  Tong Zhang,et al.  Linear prediction models with graph regularization for web-page categorization , 2006, KDD '06.

[20]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[21]  Kazunori Matsumoto,et al.  Schema Design for Causal Law Mining from Incomplete Database , 1999, Discovery Science.

[22]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[23]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[24]  Eugénio C. Oliveira,et al.  'twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages , 2011, NLDB.