NSEEN: Neural Semantic Embedding for Entity Normalization

Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. However, entity normalization is a challenging problem; there often are many textual forms for a canonical entity that may not be captured in the reference set, and entities mentioned in text may include many syntactic variations, or errors. The problem is particularly acute in scientific domains, such as biology. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.

[1]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[2]  Mourad Ouzzani,et al.  Distributed representations of tuples for entity resolution , 2018, VLDB 2018.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Jonghyun Choi,et al.  Predictable Dual-View Hashing , 2013, ICML.

[5]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[6]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[7]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[8]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[9]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[10]  Robert Muir,et al.  Apache Lucene 4 , 2012, OSIR@SIGIR.

[11]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[12]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Christoph Steinbeck,et al.  ChEBI in 2016: Improved services and an expanding collection of metabolites , 2015, Nucleic Acids Res..

[15]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[16]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Shobeir Fakhraei,et al.  Biomedical Named Entity Recognition via Reference-Set Augmented Bootstrapping , 2019, ArXiv.

[18]  Maarten Versteegh,et al.  Learning Text Similarity with Siamese Recurrent Networks , 2016, Rep4NLP@ACL.

[19]  Leonid Boytsov,et al.  Comparative Analysis of Data Structures for Approximate Nearest Neighbor Search , 2014 .

[20]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[21]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[22]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Steven Skiena,et al.  HARP: Hierarchical Representation Learning for Networks , 2017, AAAI.

[24]  Leonid Boytsov,et al.  Non-Metric Space Library Manual , 2015, ArXiv.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Vivi Nastase,et al.  Analysis of the Impact of Negative Sampling on Link Prediction in Knowledge Graphs , 2017, ArXiv.

[27]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Ulrich Paquet,et al.  Speeding up the Xbox recommender system using a euclidean transformation for inner-product spaces , 2014, RecSys '14.

[29]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[32]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[33]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[34]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[35]  Pascal Hitzler,et al.  String Similarity Metrics for Ontology Alignment , 2013, SEMWEB.

[36]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[37]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..