Transaction System Support for Scientific Data Management

Problem: Scientific instruments and simulations create huge amount of data, which needs to be processed and managed to extract the information contained. The volume of data almost doubles every year, and the precision of the instruments improve year over year as well. This calls for more sophisticated data management tools which allow the scientists to analyze the data to find specific information they need, or to find trends and anomalies in the data. While relational databases have been very successful in the commercial world, the scientists still store the data in the flat-file formats, and process them in ad-hoc manner. The scientific data require different kinds of access patterns, such as, spatial or temporal access patterns, that are not properly supported in the transaction oriented systems. Even if they do support these new access methods, the performance is not acceptable for the scientists. In the commercial world, column-based databases have been proposed to replace the row-based relational systems for large scale analytical databases. The column-based databases represent the data in a very compact way, using compression and similarity in the data columns. They also scale well with the number of columns in the table, which many of the scientific databases typically have. The column-based databases show great potential to solve many of the scientific database problems, but, they have not been studied by the database community. One of the important problem column-databases face in scientific data management is the lack of an efficient compression scheme to reduce the size of arbitrary precision column values. With the compression scheme, we will be able to achieve performance improvement in the same magnitude as the commercial database systems. Project: In this project, the student will study the implementation of an open source column-based database system and study different compression schemes applicable to arbitrary precision data. The student would integrate the compression scheme with the code, and study the performance of the system for different scientific data management problems. Plan: 1. Study and identify compression schemes suitable for scientific database management systems. 2. Study the code for existing column-based database management system. 3. Integrate the code for the compression scheme with the database code. 4. Evaluate the performance of the system against existing approaches and other schemes. 5. Suggest alternative data organization for scientific databases in column-oriented databases.