Data mining for building knowledge bases: techniques, architectures and applications

Data mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we call knowledge mining : integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

[1]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[2]  Elena Cabrio,et al.  Multilingual Question Answering over Linked Data (QALD-3): Lab Overview , 2013, CLEF.

[3]  Nathalie Aussenac-Gilles,et al.  Dynamic Ontology Co-Evolution from Texts: Principles and Case Study , 2007, ESOE.

[4]  Wei Zhang,et al.  TimeMachine: Timeline Generation for Knowledge-Base Entities , 2015, KDD.

[5]  Heng Ji,et al.  Collaborative Ranking: A Case Study on Entity Linking , 2011, EMNLP.

[6]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[7]  João Gama,et al.  A survey on learning from data streams: current and future trends , 2012, Progress in Artificial Intelligence.

[8]  AnHai Doan,et al.  Why Big Data Industrial Systems Need Rules and What We Can Do About It , 2015, SIGMOD Conference.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Amir Sadeghian,et al.  Feature Engineering for Knowledge Base Construction , 2014, IEEE Data Eng. Bull..

[11]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[12]  Wagner Meira,et al.  Named Entity Disambiguation in Streaming Data , 2012, ACL.

[13]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[14]  Aravind Srinivasan,et al.  'Beating the news' with EMBERS: forecasting civil unrest using open source indicators , 2014, KDD.

[15]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[16]  Jian Su,et al.  I2R-NUS-MSRA at TAC 2011: Entity Linking , 2011, TAC.

[17]  Christian Bizer,et al.  DBpedia: A Multilingual Cross-domain Knowledge Base , 2012, LREC.

[18]  A. Hoffmann,et al.  Incremental knowledge acquisition for extracting temporal relations , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[19]  Ellen Riloff,et al.  User Type Classification of Tweets with Implications for Event Recognition , 2014 .

[20]  Heng Ji,et al.  Collective Tweet Wikification based on Semi-supervised Graph Regularization , 2014, ACL.

[21]  James Clarke,et al.  Basis Technology at TAC 2012 Entity Linking , 2012, TAC.

[22]  Daisy Zhe Wang,et al.  Knowledge expansion over probabilistic knowledge bases , 2014, SIGMOD Conference.

[23]  Jun Zhao,et al.  CASIA@QALD-3: A Question Answering System over Linked Data , 2013, CLEF.

[24]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[25]  Jeffrey M. Zacks,et al.  Event structure in perception and conception. , 2001, Psychological bulletin.

[26]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[27]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[28]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, Proceedings of the VLDB Endowment International Conference on Very Large Data Bases.

[29]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[30]  Heng Ji,et al.  Knowledge Base Population: Successful Approaches and Challenges , 2011, ACL.

[31]  Borko Furht,et al.  Handbook of Data Intensive Computing , 2011 .

[32]  Elena Cabrio,et al.  Question Answering over Linked Data (QALD-5) , 2014, CLEF.

[33]  Alfred Krzywicki,et al.  A Large-Scale Evaluation of an E-mail Management Assistant , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[34]  Jun Zhao,et al.  Question Answering over Linked Data Using First-order Logic , 2014, EMNLP.

[35]  Byeong Ho Kang,et al.  Personalized Web Document Classification using MCRDR , 2004 .

[36]  Hsinchun Chen,et al.  Identity matching using personal and social identity features , 2011, Inf. Syst. Frontiers.

[37]  Richard Sproat,et al.  Mining named entities with temporally correlated bursts from multilingual web news streams , 2011, WSDM '11.

[38]  AnHai Doan,et al.  Social Media Analytics: The Kosmix Story , 2013, IEEE Data Eng. Bull..

[39]  Paul Compton,et al.  Improving the Performance of a Named Entity Recognition System with Knowledge Acquisition , 2012, EKAW.

[40]  Manfred Stede,et al.  Conceptual and Practical Steps in Event Coreference Analysis of Large-scale Data , 2014, EVENTS@ACL.

[41]  Anand Rajaraman,et al.  Building, maintaining, and using knowledge bases: a report from the trenches , 2013, SIGMOD '13.

[42]  Sean Monahan,et al.  Cross-Lingual Cross-Document Coreference with Entity Linking , 2011, TAC.

[43]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[44]  Jeff Z. Pan,et al.  Resource Description Framework , 2020, Definitions.

[45]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[46]  Dietrich Klakow,et al.  Effective Slot Filling Based on Shallow Distant Supervision Methods , 2014, TAC.

[47]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[48]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[49]  Lise Getoor,et al.  Probabilistic Similarity Logic , 2010, UAI.

[50]  Kenneth McGarry,et al.  A survey of interestingness measures for knowledge discovery , 2005, The Knowledge Engineering Review.

[51]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[52]  Ellen Riloff,et al.  Multi-faceted Event Recognition with Bootstrapped Dictionaries , 2013, NAACL.

[53]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[54]  Frank van Harmelen,et al.  Web Ontology Language: OWL , 2004, Handbook on Ontologies.

[55]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[56]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[57]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[58]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[59]  Gerhard Widmer,et al.  Tracking Context Changes through Meta-Learning , 1997, Machine Learning.

[60]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[61]  Hila Becker,et al.  Identifying content for planned events across social media sites , 2012, WSDM '12.

[62]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[63]  Philip A. Schrodt,et al.  Political Science: KEDS—A Program for the Machine Coding of Event Data , 1994 .

[64]  Christian Biemann,et al.  Ontology Learning from Text: A Survey of Methods , 2005, LDV Forum.

[65]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[66]  Sébastien Ferré squall2sparql: a Translator from Controlled English to Full SPARQL 1.1 , 2013, CLEF.

[67]  Mary Brunson,et al.  Qualities of Eventiveness , 2014, EVENTS@ACL.

[68]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[69]  Aditya Kalyanpur,et al.  Automatic knowledge extraction from documents , 2012, IBM J. Res. Dev..

[70]  H. Van Dyke Parunak,et al.  Dynamic Decentralized Any-Time Hierarchical Clustering , 2006, ESOA.

[71]  Heng Ji,et al.  RPI-BLENDER TAC-KBP2013 Knowledge Base Population System , 2013, TAC.

[72]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[73]  Paul Compton,et al.  EMMA: an e-mail management assistant , 2003, IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003..

[74]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[75]  D. L. Urso,et al.  Enterprise transformation: The IBM journey to Value Services , 2012, IBM J. Res. Dev..

[76]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[77]  Katrin Erk,et al.  Probabilistic Soft Logic for Semantic Textual Similarity , 2014, ACL.

[78]  Alfred Krzywicki,et al.  Exploiting Concept Clumping for Efficient Incremental News Article Categorization , 2011, ADMA.

[79]  Christopher D. Manning,et al.  Stanford's Distantly Supervised Slot Filling Systems for KBP 2014 , 2014 .

[80]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[81]  Haixun Wang,et al.  Short text understanding through lexical-semantic analysis , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[82]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[83]  Xiaofeng Meng,et al.  Query Understanding through Knowledge-Based Conceptualization , 2015, IJCAI.

[84]  Mohammed J. Zaki,et al.  Lazy Associative Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[85]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[86]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[87]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[88]  Paul Compton,et al.  Improving Open Information Extraction for Informal Web Documents with Ripple-Down Rules , 2012, PKAW.

[89]  Samson W. Tu,et al.  Supporting Collaborative Ontology Development in Protégé , 2008, SEMWEB.

[90]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[91]  Wenjie Li,et al.  Sequential Summarization: A Full View of Twitter Trending Topics , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[92]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[93]  Diana Maynard,et al.  NLP Techniques for Term Extraction and Ontology Population , 2008, Ontology Learning and Population.

[94]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[95]  Steffen Staab,et al.  On How to Perform a Gold Standard Based Evaluation of Ontology Learning , 2006, SEMWEB.

[96]  P. Compton,et al.  A philosophical basis for knowledge acquisition , 1990 .

[97]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[98]  Peter Haase,et al.  Learning Expressive Ontologies , 2008, Ontology Learning and Population.

[99]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[100]  Tetsuya Nasukawa,et al.  Text analysis and knowledge mining system , 2001, IBM Syst. J..

[101]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[102]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[103]  Yitong Li,et al.  Entity Linking for Tweets , 2013, ACL.

[104]  James Hodson,et al.  Unsupervised Techniques for Extracting and Clustering Complex Events in News , 2014, EVENTS@ACL.

[105]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[106]  Tim Oates,et al.  A Context-Aware Approach to Entity Linking , 2012, AKBC-WEKEX@NAACL-HLT.

[107]  Marko Brunzel,et al.  The XTREEM Methods for Ontology Learning from Web Documents , 2008, Ontology Learning and Population.

[108]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[109]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[110]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[111]  Heng Ji,et al.  Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media , 2013, ACL.

[112]  Dongyan Zhao,et al.  Natural language question answering over RDF: a graph data driven approach , 2014, SIGMOD Conference.

[113]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[114]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.