Employing Semantic Context for Sparse Information Extraction Assessment

A huge amount of texts available on the World Wide Web presents an unprecedented opportunity for information extraction (IE). One important assumption in IE is that frequent extractions are more likely to be correct. Sparse IE is hence a challenging task because no matter how big a corpus is, there are extractions supported by only a small amount of evidence in the corpus. However, there is limited research on sparse IE, especially in the assessment of the validity of sparse IEs. Motivated by this, we introduce a lightweight, explicit semantic approach for assessing sparse IE.1 We first use a large semantic network consisting of millions of concepts, entities, and attributes to explicitly model the context of any semantic relationship. Second, we learn from three semantic contexts using different base classifiers to select an optimal classification model for assessing sparse extractions. Finally, experiments show that as compared with several state-of-the-art approaches, our approach can significantly improve the F-score in the assessment of sparse extractions while maintaining the efficiency.

[1]  Yunyao Li,et al.  Synthesizing Extraction Rules from User Examples with SEER , 2017, SIGMOD Conference.

[2]  Luciano Del Corro,et al.  MinIE: Minimizing Facts in Open Information Extraction , 2017, EMNLP.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[5]  Mohamed Nazih Omri,et al.  Biomedical concept extraction based Information Retrieval model: application on the MeSH , 2015, 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA).

[6]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[7]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[8]  Philip S. Yu,et al.  Learning Entity Types from Query Logs via Graph-Based Modeling , 2015, CIKM.

[9]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[10]  Doug Downey,et al.  Analysis of a probabilistic model of redundancy in unsupervised information extraction , 2010, Artif. Intell..

[11]  Doug Downey,et al.  Sparse Information Extraction: Unsupervised Language Models to the Rescue , 2007, ACL.

[12]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Guodong Zhou,et al.  Tree kernel-based semantic relation extraction with rich syntactic and semantic information , 2010, Inf. Sci..

[15]  Frederick Reiss,et al.  Declarative Cleaning of Inconsistencies in Information Extraction , 2016, TODS.

[16]  Gerhard Weikum,et al.  Combining information extraction and human computing for crowdsourced knowledge acquisition , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Gerhard Weikum,et al.  Gem-based entity-knowledge maintenance , 2013, CIKM.

[18]  Tsvi Kuflik,et al.  Harvesting Entity-relation Social Networks from the Web: Potential and Challenges , 2017, UMAP.

[19]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[20]  Zhiyuan Liu,et al.  Knowledge Representation Learning with Entities, Attributes and Relations , 2016, IJCAI.

[21]  Daisy Zhe Wang,et al.  Multimodal Learning for Web Information Extraction , 2017, ACM Multimedia.

[22]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[23]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[24]  Pinar Senkul,et al.  WaPUPS: Web access pattern extraction under user-defined pattern scoring , 2016, J. Inf. Sci..

[25]  Abderrahim El Qadi,et al.  Context-aware query expansion method using Language Models and Latent Semantic Analyses , 2017, Knowledge and Information Systems.

[26]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[27]  Mayank Kejriwal,et al.  Information Extraction in Illicit Web Domains , 2017, WWW.

[28]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[29]  Sergio Oramas,et al.  A Rule-Based Approach to Extracting Relations from Music Tidbits , 2015, WWW.

[30]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, Proc. VLDB Endow..

[31]  Oren Etzioni,et al.  What Is This, Anyway: Automatic Hypernym Discovery , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[32]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[33]  Ryohei Orihara,et al.  Applying Information Extraction for Patent Structure Analysis , 2017, SIGIR.

[34]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[35]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[36]  Dongwoo Kim,et al.  Context-Dependent Conceptualization , 2013, IJCAI.

[37]  Wei Jin,et al.  Building semantic kernels for cross-document knowledge discovery using Wikipedia , 2016, Knowledge and Information Systems.

[38]  Pingyu Jiang,et al.  A deep learning approach for relationship extraction from interaction context in social manufacturing paradigm , 2016, Knowl. Based Syst..

[39]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[40]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41]  Amal Zouaq,et al.  A Machine learning Filter for Relation Extraction , 2016, WWW.

[42]  John A. Swets,et al.  Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers , 1996 .

[43]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[44]  Gerhard Weikum,et al.  KORE: keyphrase overlap relatedness for entity disambiguation , 2012, CIKM.

[45]  Yang Li,et al.  Entity Disambiguation with Linkless Knowledge Bases , 2016, WWW.

[46]  Bernard Espinasse,et al.  OntoILPER: an ontology- and inductive logic programming-based system to extract entities and relations from text , 2017, Knowledge and Information Systems.

[47]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[48]  Christopher Ré,et al.  Snorkel: Fast Training Set Generation for Information Extraction , 2017, SIGMOD Conference.

[49]  Ronen Feldman,et al.  Boosting Unsupervised Relation Extraction by Using NER , 2006, EMNLP.

[50]  Min-Ling Zhang,et al.  Lift: Multi-Label Learning with Label-Specific Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Firoj Alam,et al.  A knowledge-poor approach to chemical-disease relation extraction , 2016, Database J. Biol. Databases Curation.

[52]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[53]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[54]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[55]  Mayank Kejriwal,et al.  Information Extraction in Illicit Domains , 2017, ArXiv.

[56]  M. de Rijke,et al.  Document Filtering for Long-tail Entities , 2016, CIKM.

[57]  Vasudeva Varma,et al.  Extracting semantic knowledge from Wikipedia category names , 2013, AKBC '13.

[58]  Flavius Frasincar,et al.  A semantic approach for extracting domain taxonomies from text , 2014, Decis. Support Syst..

[59]  Doug Downey,et al.  Improved Extraction Assessment through Better Language Models , 2010, HLT-NAACL.

[60]  Kuansan Wang,et al.  Entity linking at the tail: sparse signals, unknown entities, and phrase models , 2014, WSDM.

[61]  Seung-won Hwang,et al.  Attribute extraction and scoring: A probabilistic approach , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[62]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[63]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..

[64]  Xindong Wu,et al.  A Large Probabilistic Semantic Network Based Approach to Compute Term Similarity , 2015, IEEE Transactions on Knowledge and Data Engineering.