Semantic Profiling in Data Lake

In the Big Data community, Data Lakes have become the de facto standard for storing data. Often, these data lakes are contained within the Hadoop ecosystem, where the actual storage happens on Hadoop Distributed File System(HDFS). The data is then stored in its raw (structured or unstructured) form and whenever an application needs the data, it interprets the raw data. This approach is a schemaon-read in which the interpretation of the data and potential consistency checks happen when the data is read by an application. The biggest challenge for data lake governance is to avoid that it turns in a so-called data swamp. Data swamp occurs due to various reasons. First, there is the data quality aspect, such as noisy or incorrect data. Second, exponential growth of data ingested and lack of schema enforcement. Lastly, the amount of used schemas tends to grow. In this work we present a new metadata extension to data lake systems by semantic profiling, which attempts to recognize the meaning of the data which is ingested into the Data Lake. The developed tool does not only detect meaning at schema level, but also at the data instance level by employing domain vocabularies and ontologies. With our tool, ingested datasets can easily be mapped to common domain concepts with unique identifiers and the meaning of the data can be discovered by the system. The developed profiling tool will help to produce meaningful summaries of the ingested content and provides opportunities to link relevant data sets ingested using different data schemas. We evaluate our tool by using two cancer genome datasets. We use semantic profiling tool during data ingestion and observe how data sets are tagged and profiled. Our experiments show that Semantic Ingestion is a promising approach for enriching the data sets in a data lake.

[1]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[2]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[3]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[4]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[5]  Alexiei Dingli,et al.  Automatic semantic annotation using unsupervised information extraction and integration , 2003 .

[6]  Steffen Staab,et al.  International Handbooks on Information Systems , 2013 .

[7]  Rolf Apweiler,et al.  The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries , 2006, BMC Bioinformatics.

[8]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[9]  Jianwei Yin,et al.  Exploring Semantic Web technologies for ontology-based modeling in collaborative engineering design , 2008 .

[10]  Natalya F. Noy,et al.  BioPortal: Ontologies and Integrated Data Resources at the Click of a Mouse , 2009 .

[11]  Muhammad Shoaib,et al.  Ontology based knowledge representation and semantic profiling in personalized semantic social networking framework , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[12]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[13]  Felix Naumann,et al.  Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[14]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[15]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[16]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[17]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[18]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[19]  John P. A. Ioannidis,et al.  Big data meets public health , 2014, Science.

[20]  Alejandro Pazos,et al.  BiOSS: A system for biomedical ontology selection , 2014, Comput. Methods Programs Biomed..

[21]  Faouzi Boufares,et al.  Semantic Recognition of a Data Structure in Big-Data , 2014 .

[22]  Natalia G. Miloslavskaya,et al.  Information Security Maintenance Issues for Big Security-Related Data , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[23]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[24]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[25]  François Scharffe,et al.  Towards Semantic Dataset Profiling , 2014, PROFILES@ESWC.

[26]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[27]  Inder Monga,et al.  Lambda architecture for cost-effective batch and speed big data processing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[28]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[29]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[30]  Huang Fang Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[31]  Anila Sahar Butt Ontology Search: Finding the Right Ontologies on the Web , 2015, WWW.

[32]  Ahmed,et al.  Big-Data Processing Techniques and Their Challenges in Transport Domain , 2015 .

[33]  Hassan H. Alrehamy,et al.  Personal Data Lake with Data Gravity Pull , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[34]  Stathes Hadjiefthymiades,et al.  An Efficient Time Optimized Scheme for Progressive Analytics in Big Data , 2015, Big Data Res..

[35]  Peter A. Chow-White,et al.  An empirical study of the rise of big data in business scholarship , 2016, Int. J. Inf. Manag..

[36]  Beth Plale,et al.  Provenance as Essential Infrastructure for Data Lakes [ Preprint , forthcoming in IPAW 2016 ] , 2016 .

[37]  Beth Plale,et al.  Crossing analytics systems: A case for integrated provenance in data lakes , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[38]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[39]  M. Mostert,et al.  Big Data in medical research and EU data protection law: challenges to the consent or anonymise approach , 2016, European Journal of Human Genetics.

[40]  Alon Y. Halevy,et al.  Managing Google's data lake: an overview of the Goods system , 2016, IEEE Data Eng. Bull..

[41]  Toon Calders,et al.  Towards Information Profiling: Data Lake Content Metadata Management , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[42]  Natalia Miloslavskaya,et al.  Big Data, Fast Data and Data Lake Concepts , 2016, BICA.

[43]  Abhay Bhadani,et al.  Big Data: Challenges, Opportunities and Realities , 2017, ArXiv.

[44]  María Poveda-Villalón,et al.  Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web , 2016, Semantic Web.

[45]  Felix Naumann,et al.  Data Profiling: A Tutorial , 2017, SIGMOD Conference.

[46]  Martin J. O'Connor,et al.  NCBO Ontology Recommender 2.0: an enhanced approach for biomedical ontology recommendation , 2016, Journal of Biomedical Semantics.

[47]  Boualem Benatallah,et al.  CoreDB: a Data Lake Service , 2017, CIKM.

[48]  Bala M. Balachandran,et al.  Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence , 2017, KES.

[49]  Wo L. Chang,et al.  NIST Big Data Interoperability Framework: Volume 1, Definitions , 2019 .