Age and Gender Prediction on Health Forum Data

Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users’ content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user’s age and gender from their forum posts. We use a mix of features from a user’s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.

[1]  Noémie Elhadad,et al.  Cancer Stage Prediction Based on Patient Online Discourse , 2010, BioNLP@ACL.

[2]  F Gerr,et al.  Medical information on the Internet: a study of an electronic bulletin board. , 1997, Journal of general internal medicine.

[3]  R. Shprintzen,et al.  What's in a name? , 1990, The Cleft palate journal.

[4]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[5]  Akhil Kumar,et al.  Tell Me What I Don't Know--Making the Most of Social Health Forums , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[6]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[7]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[8]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[9]  R. Kravitz,et al.  Lingering questions and doubts: online information-seeking of support forum members following their medical visits. , 2011, Patient education and counseling.

[10]  Howard Frumkin,et al.  Medical information on the internet , 1997, Journal of General Internal Medicine.

[11]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[12]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.