Gender Classification for Web Forums

More and more women are participating in and exchanging opinions through community-based online social media. Questions concerning gender differences in the new media have been raised. This paper proposes a feature-based text classification framework to examine online gender differences between Web forum posters by analyzing writing styles and topics of interest. Our experiment on an Islamic women's political forum shows that feature sets containing both content-free and content-specific features perform significantly better than those consisting of only content-free features, feature selection can improve the classification results significantly, and female and male participants have significantly different topics of interest.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3]  Janyce Wiebe,et al.  Identifying Collocations for Recognizing Opinions , 2001 .

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[6]  Kelly S. Ervin,et al.  Gender and the Internet: Women Communicating and Men Searching , 2001 .

[7]  Alan Durndell,et al.  Students' linguistic behaviour in online discussion groups: Does gender matter? , 2007, Comput. Hum. Behav..

[8]  Chih-Ping Wei,et al.  Managing Clinical Use of High-Alert Drugs: A Supervised Learning Approach to Pharmacokinetic Data Analysis , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[9]  M. Jackson,et al.  Shakespeare, Fletcher, and The Two Noble Kinsmen. , 1990 .

[10]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[11]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[12]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[13]  Gregory Grefenstette,et al.  Coupling Niche Browsers and Affect Analysis for an Opinion Mining Application , 2004, RIAO.

[14]  Debora Halbert Shulamith Firestone , 2004 .

[15]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[16]  Bruce Bimber Measuring the Gender Gap on the Internet 1 , 2000 .

[17]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[18]  G. Youngs Cyberspace: The New Feminist Frontier? , 2008 .

[19]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  Ananda Mitra,et al.  Voices of the Marginalized on the Internet: Examples From a Website for Women of South Asia , 2004 .

[22]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[23]  Hsinchun Chen,et al.  Optimal Search-Based Gene Subset Selection for Gene Array Cancer Classification , 2007, IEEE Transactions on Information Technology in Biomedicine.

[24]  Jane E. Fountain,et al.  Constructing the information society: women, information technology, and design , 2000 .

[25]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[26]  Muzaffer Özakça,et al.  Letters to Sarah: analysis of email responses to an online editorial , 2005, New Media Soc..

[27]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[28]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[29]  S. Ziebland,et al.  Gender, cancer experience and internet use: a comparative keyword analysis of interviews and online cancer support groups. , 2006, Social science & medicine.

[30]  Pero Subasic,et al.  Affect analysis of text using fuzzy semantic typing , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[31]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[32]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[33]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[34]  Jacob Zahavi,et al.  Using simulated annealing to optimize the feature selection problem in marketing applications , 2006, Eur. J. Oper. Res..

[35]  M. Tremayne,et al.  The Gendered Blogosphere: Examining Inequality using Network and Feminist Theory , 2006 .

[36]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[37]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[38]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[39]  S. Argamon,et al.  Performing Gender: Automatic Stylistic Analysis of Shakespeare's Characters , 2006 .

[40]  Mark S. Nixon,et al.  Gait Feature Subset Selection by Mutual Information , 2007, 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems.

[41]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[42]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[43]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[44]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[45]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[46]  Mia Consalvo,et al.  Women & everyday uses of the Internet : agency & identity , 2002 .

[47]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[48]  Wendy Harcourt The Personal and the Political: Women Using the Internet , 2000, Cyberpsychology Behav. Soc. Netw..

[49]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[50]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[51]  Michael Gamon,et al.  Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis , 2004, COLING.

[52]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[53]  Jiexun Li,et al.  Kernel-based learning for biomedical relation extraction , 2008 .

[54]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[55]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[56]  G. Yule,et al.  The statistical study of literary vocabulary , 1944 .

[57]  Jill E. Fuller Equality in Cyberdemocracy? Gauging Gender Gaps in On‐Line Civic Participation* , 2004 .

[58]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[59]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[60]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[61]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.