Joining and aggregating datasets using CouchDB

Data mining typically requires implementing operations that involve cross-cutting entity boundaries and are awkward to implement in document-oriented databases. CouchDB, for example, models entities as documents, with highly isolated entity boundaries, and on which joins cannot be directly performed. This project shows how join and aggregation can be achieved across entity boundaries in such systems, as encountered for example in the pre-processing and exploration stages of educational data mining. A software stack is presented as a means by which this can be achieved; first, datasets are processed via ETL operations, then MapReduce is used to create indices of ordered and aggregated data. Finally, a Couchdb list function is used to iterate through these indices and perform joins, and to compute aggregated values on joined datasets such as variance and correlations. In terms of the case study, it is shown that the proposed approach to implementing cross-document joins and aggregation is effective and scalable. In addition, it was discovered that for the 2014 2016 UCT cohorts, NBT scores correlate better with final grades for the CSC1015F course than do Grade 12 results for English, Science and Mathematics.

[1]  Martin Fowler,et al.  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence , 2012 .

[2]  Ryan S. Baker,et al.  The State of Educational Data Mining in 2009: A Review and Future Visions. , 2009, EDM 2009.

[3]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[4]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[5]  Joe Lennon Introduction to CouchDB Views , 2009 .

[6]  Ganesh Chandra Deka BASE analysis of NoSQL database , 2015, Future Gener. Comput. Syst..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Kyle Simpson You Don't Know JS: this & Object Prototypes , 2014 .

[9]  Jairam Chandar Join Algorithms using Map/Reduce , 2010 .

[10]  Qasem A. Al-Radaideh,et al.  Mining Student Data Using Decision Trees , 2006 .

[11]  Paolo Atzeni,et al.  Data modeling in the NoSQL world , 2016, Comput. Stand. Interfaces.

[12]  Zebun Nisa Khan Scholastic Achievement of Higher Secondary Students in Science Stream , 2005 .

[13]  Greg Wilson,et al.  Mining student CVS repositories for performance indicators , 2005, MSR.

[14]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[15]  Standard for Floating-Point Arithmetic , 2018 .

[16]  Dimitrios Kalles,et al.  ANALYZING STUDENT PERFORMANCE IN DISTANCE LEARNING WITH GENETIC ALGORITHMS AND DECISION TREES , 2006, Appl. Artif. Intell..

[17]  Haitham A. El-Ghareeb,et al.  A middle layer solution to support ACID properties for NoSQL databases , 2016, J. King Saud Univ. Comput. Inf. Sci..