Sparse Information Extraction: Unsupervised Language Models to the Rescue

Even in a massive corpus such as the Web, a substantial fraction of extractions appear infrequently. This paper shows how to assess the correctness of sparse extractions by utilizing unsupervised language models. The REALM system, which combines HMMbased and n-gram-based language models, ranks candidate extractions by the likelihood that they are correct. Our experiments show that REALM reduces extraction error by 39%, on average, when compared with previous work. Because REALM pre-computes language models based on its corpus and does not require any hand-tagged seeds, it is far more scalable than approaches that learn models for each individual relation from handtagged data. Thus, REALM is ideally suited for open information extraction where the relations of interest are not specified in advance and their number is potentially vast.

[1]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[2]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[3]  Oren Etzioni,et al.  Self-supervised Relation Extraction from the Web , 2006, ISMIS.

[4]  Luis Gravano,et al.  Extracting relations from large text collections , 2005 .

[5]  Eugene Agichtein Confidence Estimation Methods for Partially Supervised Information Extraction , 2006, SDM.

[6]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[10]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[13]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[14]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[15]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[16]  Eugene Agichtein Confidence Estimation Methods for Partially Supervised Relation Extraction , 2006 .

[17]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..