Understandable Big Data: A survey

This survey presents the concept of Big Data. Firstly, a definition and the features of Big Data are given. Secondly, the different steps for Big Data data processing and the main problems encountered in big data management are described. Next, a general overview of an architecture for handling it is depicted. Then, the problem of merging Big Data architecture in an already existing information system is discussed. Finally this survey tackles semantics (reasoning, coreference resolution, entity linking, information extraction, consolidation, paraphrase resolution, ontology alignment) in the Big Data context.

[1]  Ana Roxin,et al.  Automatic User Profile Mapping To Marketing Segments In A Big Data Context , 2015 .

[2]  Srinivasan H. Sengamedu Scalable Analytics - Algorithms and Systems , 2012, BDA.

[3]  Heiner Stuckenschmidt,et al.  Index structures and algorithms for querying distributed RDF repositories , 2004, WWW '04.

[4]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[5]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[6]  Garry Turkington Hadoop beginner's guide : learn how to crunch big data to extract meaning from the data avalanche , 2013 .

[7]  Gerhard Weikum,et al.  Real-time Population of Knowledge Bases: Opportunities and Challenges , 2012, AKBC-WEKEX@NAACL-HLT.

[8]  Nathanael Chambers,et al.  Template-Based Information Extraction without the Templates , 2011, ACL.

[9]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[10]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[11]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[12]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[13]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[14]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[15]  Oren Etzioni,et al.  An analysis of open information extraction based on semantic role labeling , 2011, K-CAP '11.

[16]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[17]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[18]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[19]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[20]  HoganAidan,et al.  Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora , 2012 .

[21]  Garry Turkington Hadoop Beginner's Guide , 2013 .

[22]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[23]  Kord Davis Ethics of Big Data: Balancing Risk and Innovation , 2012 .

[24]  Edward Y. Chang,et al.  Entity Disambiguation with Freebase , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[25]  Stefan Decker,et al.  Sig.ma: Live views on the Web of Data , 2010, J. Web Semant..

[26]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[27]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[28]  Ludovic Jean-Louis,et al.  Approches supervisées et faiblement supervisées pour l'extraction d'événements et le peuplement de bases de connaissances. (Supervised and weakly-supervised approaches for complex-event extraction and knowledge base population) , 2011 .

[29]  Juana María Ruiz-Martínez,et al.  ONTOLOGY POPULATION : AN APPLICATION FOR THE E-TOURISM DOMAIN , 2011 .

[30]  Christophe Cruz,et al.  Semantic HMC for Business Intelligence using Cross-Referencing , 2015 .

[31]  Frank van Harmelen,et al.  WebPIE: A Web-scale Parallel Inference Engine using MapReduce , 2012, J. Web Semant..

[32]  Kalina Bontcheva,et al.  TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text , 2013, RANLP.

[33]  I. Song,et al.  Analytics over large-scale multidimensional data: the big data revolution! , 2011, DOLAP '11.

[34]  Andreas Harth,et al.  Performing Object Consolidation on the Semantic Web Data Graph , 2007, I3.

[35]  Alexander Löser,et al.  KrakeN: N-ary Facts in Open Information Extraction , 2012, AKBC-WEKEX@NAACL-HLT.

[36]  Kwan-Liu Ma,et al.  Big-Data Visualization , 2013, IEEE Computer Graphics and Applications.

[37]  Michael L. Brodie,et al.  The meaningful use of big data: four perspectives -- four challenges , 2012, SGMD.

[38]  Myung Hee Kim Ripple-down rules based open information extraction for the web documents , 2012 .

[39]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[40]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[41]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[42]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[43]  Karsten Schwan,et al.  Faster, larger, easier: reining real-time big data processing in cloud , 2012, Middleware '12.

[44]  D. Maltby Big Data Analytics , 2014 .

[45]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[47]  Jens Lehmann,et al.  Creating knowledge out of interlinked data , 2010, Semantic Web.

[48]  Pascal Hitzler,et al.  A MapReduce Algorithm for EL+ , 2010, Description Logics.

[49]  Frank van Harmelen,et al.  Corrigendum to "WebPIE: A Web-scale Parallel Inference Engine using MapReduce" [Web Semant. Sci. Serv. Agents World Wide Web 10 (2012) 59-75] , 2012, J. Web Semant..

[50]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[51]  Dan Klein,et al.  Simple Coreference Resolution with Rich Syntactic and Semantic Features , 2009, EMNLP.

[52]  Stefan Decker,et al.  Sig.ma: live views on the web of data , 2010, WWW '10.

[53]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[54]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[55]  Carsten Felden,et al.  Big Data - A State-of-the-Art , 2012, AMCIS.

[56]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[57]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[58]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[59]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[60]  Sebastian Hellmann,et al.  Real-Time RDF Extraction from Unstructured Data Streams , 2013, SEMWEB.

[61]  Orkunt Sabuncu,et al.  An ontology-based retrieval system using semantic indexing , 2010, ICDE Workshops.

[62]  Dan Klein,et al.  Coreference Semantics from Web Features , 2012, ACL.

[63]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.

[65]  Matthias Schonlau,et al.  The Clustergram: A Graph for Visualizing Hierarchical and Nonhierarchical Cluster Analyses , 2002 .

[66]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[67]  Orkunt Sabuncu,et al.  Event Extraction from Turkish Football Web-casting Texts Using Hand-crafted Templates , 2009, 2009 IEEE International Conference on Semantic Computing.

[68]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[69]  Ladislav Hluchý,et al.  Towards Large Scale Semantic Annotation Built on MapReduce Architecture , 2008, ICCS.

[70]  Katharine Armstrong,et al.  Big data: a revolution that will transform how we live, work, and think , 2014 .

[71]  Oren Etzioni,et al.  Adapting Open Information Extraction to Domain-Specific Relations , 2010, AI Mag..

[72]  Andy Konwinski,et al.  Chukwa: A large-scale monitoring system , 2008 .

[73]  Sandeep Tata,et al.  Clydesdale: structured data processing on MapReduce , 2012, EDBT '12.

[74]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[75]  Pedro M. Domingos,et al.  Extracting Semantic Networks from Text Via Relational Clustering , 2008, ECML/PKDD.

[76]  Frank van Harmelen,et al.  Reasoning with Inconsistent Ontologies , 2005, IJCAI.

[77]  Marius Pasca,et al.  Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web , 2005, IJCNLP.

[78]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[79]  Sivanesan Dakshanamurthy,et al.  Big data: the next frontier for innovation in therapeutics and healthcare , 2014, Expert review of clinical pharmacology.

[80]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[81]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[82]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[83]  Ralph Grishman,et al.  Ensemble Semantics for Large-scale Unsupervised Relation Extraction , 2012, EMNLP.

[84]  Krzysztof Janowicz,et al.  Linked Data, Big Data, and the 4th Paradigm , 2013, Semantic Web.

[85]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[86]  Ladislav Hluchý,et al.  Ontology based Text Annotation - OnTeA , 2007, EJC.

[87]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[88]  Jürgen Umbrich,et al.  Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora , 2012, J. Web Semant..

[89]  Heeyoung Lee,et al.  Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[90]  Yadira Espinal Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work and Think , 2013 .

[91]  Jianqiang Li,et al.  Repairing and reasoning with inconsistent and uncertain ontologies , 2012, Adv. Eng. Softw..

[92]  Oren Etzioni,et al.  Unsupervised Resolution of Objects and Relations on the Web , 2007, NAACL.

[93]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[94]  Krish Krishnan,et al.  Data Warehousing in the Age of Big Data , 2013 .

[95]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[96]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.