Occupation inference through detection and classification of biographical activities

Dealing with biographical information (e.g., biography generation, answering biography-related questions, etc.) requires the identification of important activities in the life of the individual in question. While there are activities that can be used in any biography (e.g., person was born on a particular date, person lived in a particular location, etc.), many activities used in biographies tend to be occupation-related, others are person-specific. Hence, occupation gives important clues as to which activities should be included in the biography. In this paper, we present a methodology for identifying a three-level hierarchy of biographical activities: those activities that apply to the general population, those activities that are occupation-related, and those activities that are person-specific. We use the obtained occupation-related activities as features for a multi-class SVM classifier to identify the occupation of a previously unseen individual. We also show that the activities automatically obtained from text can be used as features not only for a classification task but for a clustering task as well. We show that, given the correct number of clusters, people belonging to the same occupation are clustered together. At the same time, clustering people into a smaller number of classes allows the grouping of practitioners of the occupations that share a considerable number of occupation-related activities. Thus, analyzing descriptions of people belonging to various occupations, we can build a hierarchy of occupations.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Vasileios Hatzivassiloglou,et al.  Domain -independent detection, extraction, and labeling of Atomic Events , 2003 .

[3]  Jan Damsgaard,et al.  Seven principles for selecting software packages , 2010, Commun. ACM.

[4]  Marie-Francine Moens,et al.  Multidocument Question Answering Text Summarization Using Topic Signatures , 2005, J. Digit. Inf. Manag..

[5]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[6]  Marti A. Hearst,et al.  HLT-NAACL 2003 : Human Language Technology conference of the North American Chapter of the Association for Computational Linguistics : proceedings of the main conference : May 27 to June 1, 2003, Edmonton, Alberta, Canada , 2003 .

[7]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[8]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[9]  Xindong Wu,et al.  Conceptual equivalence for contrast mining in classification learning , 2008, Data Knowl. Eng..

[10]  Salvatore J. Stolfo,et al.  Extracting context to improve accuracy for HTML content extraction , 2005, WWW '05.

[11]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[12]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[13]  Lance A. Miller,et al.  Review of The process of question answering: a computer simulation of cognition by Wendy G. Lehnert. Lawrence Erlbaum Associates 1978. , 1980 .

[14]  Vasileios Hatzivassiloglou,et al.  Automatic Creation of Domain Templates , 2006, ACL.

[15]  David Yarowsky,et al.  Structural, Transitive and Latent Models for Biographic Fact Extraction , 2009, EACL.

[16]  Vijayan Sugumaran,et al.  A parametric linguistics based approach for cross-lingual web querying , 2008, Data Knowl. Eng..

[17]  T. Mexia,et al.  Author ' s personal copy , 2009 .

[18]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[19]  Julia Hirschberg,et al.  An Unsupervised Approach to Biography Production Using Wikipedia , 2008, ACL.

[20]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[21]  Brent Simpson,et al.  Emotional reactions to losing explain gender differences in entering a risky lottery , 2010, Judgment and Decision Making.

[22]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[23]  L. A. Miller The Process of Question Answering - A Computer Simulation of Cognition , 1980, CL.

[24]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[25]  Elena Filatova,et al.  Tell Me What You Do and I'll Tell You What You Are: Learning Occupation-Related Activities for Biographies , 2005, HLT/EMNLP.

[26]  Liang Zhou,et al.  Multi-Document Biography Summarization , 2005, EMNLP.

[27]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[28]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[29]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[30]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[31]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[32]  Jeffrey C. Zemla,et al.  Missing the trees for the forest: a construal level account of the illusion of explanatory depth. , 2010, Journal of personality and social psychology.

[33]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[34]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[35]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[36]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[37]  David Carmel,et al.  Juru at TREC 10 - Experiments with Index Pruning , 2001, TREC.

[38]  Wendy Grace Lehnert,et al.  The Process of Question Answering , 2022 .

[39]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[40]  David Evans,et al.  Columbia University at DUC 2004 , 2004 .

[41]  Ralph Grishman,et al.  An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition , 2003, ACL.

[42]  Ellen Riloff,et al.  An Empirical Approach to Conceptual Case Frame Acquisition , 1998, VLC@COLING/ACL.

[43]  Regina Barzilay,et al.  Automatically Generating Wikipedia Articles: A Structure-Aware Approach , 2009, ACL.

[44]  Sasha Blair-Goldensohn,et al.  Answering Definitional Questions: A Hybrid Approach , 2004, New Directions in Question Answering.

[45]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[46]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[47]  Inderjeet Mani,et al.  Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics , 2001, ACL.

[48]  Roman Yangarber,et al.  Counter-Training in Discovery of Semantic Patterns , 2003, ACL.

[49]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[50]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[51]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[52]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[53]  Julio Gonzalo,et al.  A testbed for people searching strategies in the WWW , 2005, SIGIR '05.

[54]  Robin Collier,et al.  Automatic template creation for information extraction , 1998 .

[55]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[56]  Raymond Chi-Wing Wong,et al.  Information based data anonymization for classification utility , 2011, Data Knowl. Eng..

[57]  Eric Torng,et al.  TCAM Razor: A Systematic Approach Towards Minimizing Packet Classifiers in TCAMs , 2007, 2007 IEEE International Conference on Network Protocols.

[58]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[59]  Kathleen McKeown,et al.  Statistical Acquisition of Content Selection Rules for Natural Language Generation , 2003, EMNLP.

[60]  Jennifer Chu-Carroll,et al.  Question Answering Using Constraint Satisfaction: QA-By-Dossier-With-Contraints , 2004, ACL.