Generating Actionable Knowledge from Big Data

The last few years have seen a rapid increase of sheer amount of data produced and communicated over the Internet and the Web. While it is widely believed that the availability of such ``Big Data'' holds the potential to revolutionize many aspects of our modern society (e.g., intelligent transportation, environmental monitoring, and energy saving), many challenges need to be addressed before this potential can be realized. This PhD project focuses on one critical challenge, namely extracting actionable knowledge from Big Data. Tremendous efforts have been contributed on mining large-scale data on the Web and constructing comprehensive knowledge bases (KBs). However, existing knowledge extraction systems retrieve data from limited types of Web sources. In addition, data fusion approaches consider very little of the noises produced by those knowledge extraction systems. Consequently, the constructed KBs are far from being comprehensive and accurate. In this paper, we present our initial design of a framework for extracting machine-readable data with high precision and recall from four types of data sources, namely Web texts, Document Object Model (DOM) trees, existing KBs, and query stream. Confidence scores are attached to the resulting knowledge, which can be used to further improve the knowledge fusion results.

[1]  Gerhard Weikum,et al.  Discovering and Exploring Relations on the Web , 2012, Proc. VLDB Endow..

[2]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[3]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[4]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[5]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[6]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[7]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.

[8]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[9]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[10]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[11]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[12]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[13]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[14]  Andrew McCallum,et al.  A joint model for discovering and linking entities , 2013, AKBC '13.

[15]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[16]  Dan Roth,et al.  Generalized fact-finding , 2011, WWW.

[17]  Ricardo Baeza-Yates,et al.  Clustering and exploring search results using timeline constructions , 2009, CIKM.

[18]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[19]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[20]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[21]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[22]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[23]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[24]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[25]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[26]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[27]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[28]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[29]  Torsten Suel,et al.  Interactive wrapper generation with minimal user effort , 2006, WWW '06.

[30]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[31]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[32]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[33]  Gerhard Weikum,et al.  A Language Modeling Approach for Temporal Information Needs , 2010, ECIR.

[34]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.