Automatic Extraction and Evaluation of Human Activity Using Conditional Random Fields and Self-Supervised Learning

In our definition, human activity can be expressed by five basic attributes: actor, action, object, time and location. The goal of this paper is describe a method to automatically extract all of the basic attributes and the transition between activities derived from sentences in Japanese web pages. However, previous work had some limitations, such as high setup costs, inability to extract all attributes, limitation on the types of sentences that can be handled, and insufficient consideration interdependency among attributes. To resolve these problems, this paper proposes a novel approach that uses conditional random fields and self-supervised learning. Given a small corpus sample as input, it automatically makes its own training data and a feature model. Based on the feature model, it automatically extracts all of the attributes and the transition between the activities in each sentence retrieved from the Web corpus. This approach treats activity extraction as a sequence labeling problem, and has advantages such as domain-independence, scalability, and does not require any human input. Since it is unnecessary to fix the number of elements in a tuple, this approach can extract all of the basic attributes and the transition between activities by making only a single pass. Additionally, by converting to simpler sentences, the approach can deal with complex sentences retrieved from the Web. In an experiment, this approach achieves high precision (activity: 88.9%, attributes: over 90%, transition: 87.5%).

[1]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[2]  Matthai Philipose,et al.  Mining models of human activities from the web , 2004, WWW '04.

[3]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[4]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Koustuv Dasgupta,et al.  User interests in social media sites: an exploration with micro-blogs , 2009, CIKM.

[7]  Kôiti Hasida,et al.  Inferring Long-term User Properties Based on Users' Location History , 2007, IJCAI.

[8]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[9]  Yuji Matsumoto,et al.  Japanese Dependency Analysis using Cascaded Chunking , 2002, CoNLL.

[10]  Mor Naaman,et al.  Is it really about me?: message content in social awareness streams , 2010, CSCW '10.

[11]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[12]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[13]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[14]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[15]  Shinichiro Takagi,et al.  Japanese Morphological Analyzer using Word Co-occurence -JTAG , 1998, COLING-ACL.

[16]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.