Lexicon-Grammar based open information extraction from natural language sentences in Italian

Abstract In the last decade, the quantity of readily accessible text has grown rapidly and enormously, long exceeding the capacity of humans to read and understand it. One of the most interesting strategies proposed to fulfill this need is known as Open Information Extraction (OIE). It is essentially devised to read in sentences and rapidly extract one or more domain-independent coherent propositions, each represented by a verb relation and its arguments. Even though many OIE approaches exist for English, no significant research has been conducted about OIE on Italian texts. Due to the usage of language-specific features, OIE systems operating in other languages are not directly applicable for Italian. Therefore, this paper proposes, as first contribution, a novel approach to perform OIE for Italian language, based on standard linguistic structures to analyze sentences and on a set of verbal behavior patterns to extract information from them. These patterns are built combining a solid linguistic theoretical framework, i.e. Lexicon-Grammar (LG), and distributional profiles extracted from a contemporary Italian corpus, i.e. itWaC. Starting from simple sentences, the approach is able to determine elementary tuples, then, all their permutations, by adding complements and adverbials, and, finally, n-ary propositions, by granting syntactic invariance, preserving the overall grammaticality and also respecting some syntactic constraints and selection preferences, thus approximating a first level of semantic acceptability. As second contribution of this work, a gold standard dataset for the Italian language has been built from the itWaC corpus, aimed at being widely used to enable the experimental validation of OIE solutions. It has been manually and independently labeled by four Italian native speakers with all the n-ary propositions that can be extracted, following the criteria of grammaticality and acceptability, i.e. granting syntactic well-formedness and meaningfulness in the context. Finally, the proposed approach has been experimented and quantitatively validated on this gold standard dataset, also in comparison with an indirect approach translating input sentences and output propositions from Italian to English and vice versa and embedding an OIE approach for English, as well as with an OIE system for Italian previously presented by the authors. The results obtained have shown the effectiveness of the proposed approach in generating propositions with respect to these criteria of grammaticality and acceptability. Even if the approach has been evaluated for the Italian language, it is essentially based on linguistic resources produced by LG, which exist for many languages besides Italian and a representative corpus for the language under consideration. Given these premises, it has a general basis from a methodological perspective and can be proficiently extended also to other languages.

[1]  R. Berwick,et al.  Colorless green ideas do sleep furiously: gradient acceptability and the nature of the grammar , 2018, The Linguistic Review.

[2]  Christian Leclère,et al.  La structure des phrases Simples en francais: Constructions Intransitives , 1979 .

[3]  Alexander Clark,et al.  Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , 2017, Cogn. Sci..

[4]  Simonetta Vietri,et al.  Lessico-grammatica dell'italiano , 2004 .

[5]  Siddharth Patwardhan,et al.  Learning Domain-Specific Information Extraction Patterns from the Web , 2006 .

[6]  J. Durand,et al.  J.-P. Boons, A. Guillet & Ch. Leclére. La structure des phrases simples en français: constructions intransitives. Genéve: Droz, 1976. , 1979, Journal of Linguistics.

[7]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[8]  Roman Kern,et al.  GerIE - An Open Information Extraction System for the German Language , 2018, J. Univers. Comput. Sci..

[9]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[10]  Michelle Garcia-Vega,et al.  Transitive phrasal verbs with the particle "out": A lexicon-grammar analysis , 2011 .

[11]  A. Strauss,et al.  Basics of Qualitative Research , 1992 .

[12]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[13]  Daniela Barreiro Claro,et al.  Inference Approach to Enhance a Portuguese Open Information Extraction , 2017, ICEIS.

[14]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[15]  James Pustejovsky,et al.  A Pattern Dictionary for Natural Language Processing , 2005 .

[16]  Luigi Rizzi,et al.  Issues in Italian Syntax , 1981 .

[17]  Roberto Navigli,et al.  Integrating Syntactic and Semantic Analysis into the Open Information Extraction Paradigm , 2013, IJCAI.

[18]  Oren Etzioni,et al.  Towards Coherent Multi-Document Summarization , 2013, NAACL.

[19]  Zellig S. Harris,et al.  A Grammar of English on Mathematical Principles , 1982 .

[20]  Alessandro Lenci,et al.  LexIt: A Computational Resource on Italian Argument Structure , 2012, LREC.

[21]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[22]  Pablo Gamallo,et al.  Multilingual Open Information Extraction , 2015, EPIA.

[23]  Jorge Baptista ViPEr: A Lexicon-Grammar of European Portuguese Verbs , 2012 .

[24]  L. Rizzi Null objects in Italian and the theory of 'pro' , 1986 .

[25]  Uyen Trang Nguyen,et al.  Vietnamese Open Information Extraction , 2017, SoICT.

[26]  Manfred Pinkal,et al.  Generating FrameNets of Various Granularities: The FrameNet Transformer , 2010, LREC.

[27]  Rada Mihalcea,et al.  Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation , 2003 .

[28]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[29]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.

[30]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[31]  Amal Zouaq,et al.  A Machine learning Filter for Relation Extraction , 2016, WWW.

[32]  Ido Dagan,et al.  Creating a Large Benchmark for Open Information Extraction , 2016, EMNLP.

[33]  Peter Clark,et al.  Answering Complex Questions Using Open Information Extraction , 2017, ACL.

[34]  Peter A. Machonis English Phrasal Verbs: from Lexicon-Grammar to Natural Language Processing , 2010 .

[35]  Pablo Gamallo,et al.  Dependency-Based Open Information Extraction , 2012 .

[36]  Marco Baroni,et al.  Building general- and special-purpose corpora by Web crawling , 2006 .

[37]  Ronald Wardhaugh Understanding english grammar , 2013 .

[38]  Daniela Barreiro Claro,et al.  DptOIE: a portuguese Open Information Extraction system based on dependency analysis , 2019 .

[39]  Daniela Barreiro Claro,et al.  InferPortOIE: A Portuguese Open Information Extraction system with inferences , 2018, Natural Language Engineering.

[40]  Christian Leclère The Lexicon-Grammar of French Verbs , 2005 .

[41]  Eric Laporte,et al.  Conversion of Lexicon-Grammar tables to LMF. Application to French , 2013 .

[42]  Ido Dagan,et al.  Porting an Open Information Extraction System from English to German , 2016, EMNLP.

[43]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[44]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[45]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[46]  C. Phillips Some arguments and nonarguments for reductionist accounts of syntactic phenomena , 2013 .

[47]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[48]  Elsa Tolone Analyse syntaxique à l’aide des tables du Lexique-Grammaire du français , 2012 .

[49]  Oren Etzioni,et al.  Open question answering over curated and extracted knowledge bases , 2014, KDD.

[50]  Maurice Gross,et al.  La structure des phrases simples en français , 1976 .

[51]  Ebrahim Bagheri,et al.  Open Information Extraction , 2016, Encycl. Semantic Comput. Robotic Intell..

[52]  Victoria Bobicev,et al.  Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective , 2017, RANLP.

[53]  Christian Leclère,et al.  Organization of the lexicon-grammar of French verbs , 2002 .

[54]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[55]  Lei Li,et al.  Semi-supervised Chinese Open Entity Relation Extraction , 2014, 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems.

[56]  Massimo Esposito,et al.  Open Information Extraction for Italian Sentences , 2018, 2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA).

[57]  Maurizio Lenzerini,et al.  Senso Comune , 2010, LREC.

[58]  Ido Dagan,et al.  Open IE as an Intermediate Structure for Semantic Tasks , 2015, ACL.

[59]  Morten H. Christiansen,et al.  The need for quantitative methods in syntax and semantics research , 2013 .