论文信息 - Logistic regression within DBMS

Logistic regression within DBMS

The context of this paper is to come up with an analytical query model for data categorization within DBMS. DBMS being the asset for most of the organizations, classification can help in getting better insight and control over the data. Conventionally, classification algorithms like logistic regression, KNN, etc. are applied after exporting the data out of DBMS, using non DBMS tools like R, matrix packages, generic data mining programs or large scale systems like Hadoop and Spark. However, this leads to I/O overhead since the data within DBMS is updated quite frequently and usually cannot be accommodated in the main memory. This paper proposes an alternative strategy, based on SQL and UDFs, to integrate the logistic regression for data categorization as well as prediction query processing within DBMS. A comparison of SQL with user defined functions (UDFs) as well as with statistical packages like R is presented, by experimentation on real datasets. The empirical results show the viability and validity of this approach for predicting the class of a given query.

Sandhya Harikumar | Jackson Isaac | Sandhya Harikumar | J. Isaac

[1] Carlos Ordonez,et al. Can we analyze big data inside a DBMS? , 2013, DOLAP '13.

[2] Sandhya Harikumar,et al. Implementation of projected clustering based on SQL queries and UDFs in relational databases , 2013, 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS).

[3] Sunita Sarawagi,et al. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications , 1998, SIGMOD '98.

[4] Carlos Ordonez,et al. Statistical Model Computation with UDFs , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5] Arvind Thiagarajan. Representing and Querying Regression Models in a DBMS , 2007 .

[6] Johannes Gehrke,et al. BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[7] Sophie Leuenberger. Multiple linear regression in databases , 2014 .

[8] Carlos Ordonez. Building statistical models and scoring with UDFs , 2007, SIGMOD '07.

[9] Carlos Ordonez,et al. Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[11] Tadeusz Morzy,et al. Data Mining Support in Database Management Systems , 2000, DaWaK.