HacRED: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications

Relation extraction (RE) is an essential topic in natural language processing and has attracted extensive attention. Current RE approaches achieve fantastic results on common datasets, while they still struggle on practical applications. In this paper, we analyze the above performance gap, the underlying reason of which is that practical applications intrinsically have more hard cases. To make RE models more robust on such practical hard cases, we propose a case-oriented construction framework to build a Hard Case Relation Extraction Dataset (HacRED). The proposed HacRED consists of 65,225 relational facts annotated from 9,231 documents with sufficient and diverse hard cases. Notably, HacRED is one of the largest Chinese document-level RE datasets and achieves a high 96% F1 score on data quality. Furthermore, we apply the stateof-the-art RE models on this dataset and conduct a thorough evaluation. The results show that the performance of these models is far lower than humans, and RE applying on practical hard cases still requires further efforts. HacRED is publicly available at https://github. com/qiaojiim/HacRED.

[1]  Jie Zhou,et al.  More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction , 2020, AACL/IJCNLP.

[2]  Shashi Narayan,et al.  Creating Training Corpora for NLG Micro-Planners , 2017, ACL.

[3]  Danqi Chen,et al.  Position-aware Attention and Supervised Data Improve Slot Filling , 2017, EMNLP.

[4]  Bin Liang,et al.  CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System , 2017, IEA/AIE.

[5]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[6]  Tianyang Zhang,et al.  A Hierarchical Framework for Relation Extraction with Reinforcement Learning , 2018, AAAI.

[7]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[8]  Arzucan Ozgur,et al.  The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification , 2020 .

[9]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[10]  Maosong Sun,et al.  FewRel 2.0: Towards More Challenging Few-Shot Relation Classification , 2019, EMNLP.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Gang Niu,et al.  Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[14]  Tengyu Ma,et al.  Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling , 2020, AAAI.

[15]  Jun Zhao,et al.  Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism , 2018, ACL.

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Lora Aroyo,et al.  CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement (short paper) , 2018, SAD/CrowdBias@HCOMP.

[18]  Lemao Liu,et al.  TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis , 2020, ArXiv.

[19]  Chris Develder,et al.  DWIE: an entity-centric dataset for multi-task document-level information extraction , 2020, Inf. Process. Manag..

[20]  Maosong Sun,et al.  DocRED: A Large-Scale Document-Level Relation Extraction Dataset , 2019, ACL.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Zhendong Mao,et al.  Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction , 2021, AAAI.

[23]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[24]  Wei Lu,et al.  Reasoning with Latent Structure Refinement for Document-Level Relation Extraction , 2020, ACL.

[25]  Kenny Q. Zhu,et al.  DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic Dialogues , 2020, AAAI.

[26]  Danqi Chen,et al.  A Frustratingly Easy Approach for Joint Entity and Relation Extraction , 2020, ArXiv.

[27]  Ming Zhou,et al.  Question Answering over Freebase with Multi-Column Convolutional Neural Networks , 2015, ACL.

[28]  Wei-Yun Ma,et al.  GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction , 2019, ACL.

[29]  Zhiyuan Liu,et al.  FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation , 2018, EMNLP.

[30]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[31]  Yanghua Xiao,et al.  Short Text Entity Linking with Fine-grained Topics , 2018, CIKM.

[32]  Zhepei Wei,et al.  A Novel Cascade Binary Tagging Framework for Relational Triple Extraction , 2019, ACL.

[33]  Peng Zhou,et al.  Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme , 2017, ACL.

[34]  Preslav Nakov,et al.  SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals , 2009, SEW@NAACL-HLT.

[35]  Shuang Zeng,et al.  Double Graph Based Reasoning for Document-level Relation Extraction , 2020, EMNLP.

[36]  Aleksandra Gabryszak,et al.  TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task , 2020, ACL.