Yet another Platform for Extracting Knowledge from Corpora

The research field of “extracting knowledge bases from text collections” seems to be mature: its target and its working hypotheses are clear. In this paper we propose a platform, YAPEK, i.e., Yet Another Platform for Extracting Knowledge from corpora, that wants to be the base to collect the majority of algorithms for extracting knowledge bases from corpora. The idea is that, when many knowledge extraction algorithms are collected under the same platform, relative comparisons are clearer and many algorithms can be leveraged to extract more valuable knowledge for final tasks such as Textual Entailment Recognition. As we want to collect many knowledge extraction algorithms, YAPEK is based on the three working hypotheses of the area: the basic hypothesis, the distributional hypothesis, and the point-wise assertion patterns. In YAPEK, these three hypotheses define two spaces: the space of the target textual forms and the space of the contexts. This platform guarantees the possibility of rapidly implementing many models for extracting knowledge from corpora as the platform gives clear entry points to model what is really different in the different algorithms: the feature spaces, the distances in these spaces, and the actual algorithm.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[3]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[4]  Ido Dagan,et al.  Scaling Web-based Acquisition of Entailment Relations , 2004, EMNLP.

[5]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[6]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[7]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[8]  Christian Jacquemin,et al.  Spotting and Discovering Terms through Natural Language Processing , 1997 .

[9]  John D. Burger,et al.  Generating an Entailment Corpus from News Headlines , 2005, EMSEE@ACL.

[10]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[11]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[12]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[13]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[14]  Timothy Baldwin,et al.  Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition , 2005, ACL 2005.

[15]  Harold R. Robison Computer-detectable semantic structures , 1970, Inf. Storage Retr..

[16]  Roberto Basili,et al.  Parsing engineering and empirical robustness , 2002, Natural Language Engineering.

[17]  Oren Glickman IDENTIFYING LEXICAL PARAPHRASES FROM A SINGLE CORPUS: A CASE STUDY FOR VERBS , 2003 .

[18]  Dekang Lin,et al.  DIRT – Discovery of Inference Rules from Text , 2001 .