Query-Driven Data Profiling with OCEANProfile

Complex data analysis scenarios often require discovering and combining multiple data sources. Data scientists usually formulate a series of SQL queries building on each other, also called a session, to iteratively derive results. However, due to a lack of familiarity with data sources or the complexity of query results, it can be a hard task to decide on the next query iteration solely based on the results of the last one. While existing approaches provide mechanisms to assess the results of a specific query, support for analyzing results in the context of the respective session remains mostly absent. Such approaches do also not seamlessly integrate with established tools and workflows. To overcome these problems, we introduce OCEANProfile, a framework for session-based profiling of query results. Query results are intercepted at driver level and streamed into our framework for automated data profiling. Result profiles can be compared with those of previous queries and visualized in a companion app compatible with existing analysis tools. Visualizations are automatically ranked according to their usefulness in the context of the respective session.