NoSQL vs relational database: A comparative study about the generation of the most frequent N-grams

This study intends to help data mining developers to get better performance when obtaining the most frequent N-grams in Text Mining projects. The process of building new variables is one of the oldest and still challenging problems in Data Mining projects. The most frequent N-grams are commonly used as input variables in Text Mining projects. The N-grams represent the occurrence of N items in sequence in a given text. The items can be letters or words. This paper presents a performance comparison between the two main approaches of data storage, relational and NoSQL databases in the task of obtaining the most frequent N-grams. Validation of the study was executed using a database from a known benchmark from an international competition organized by PAN@CLEF 2013. The one-tailed paired t-test showed that NoSQL approach is statistically superior to the relational approach with a confidence level of 95%.

[1]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[2]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[3]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[4]  Ana Carolina Salgado,et al.  A framework for data transformation in Credit Behavioral Scoring applications based on Model Driven Development , 2017, Expert Syst. Appl..

[5]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[6]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[7]  Dorin Carstoiu,et al.  Hbase - non SQL Database, Performances Evaluation , 2010, Int. J. Adv. Comp. Techn..

[8]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[9]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[10]  Vladik Kreinovich,et al.  A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation) , 2017, Int. J. Gen. Syst..

[11]  Xin Li,et al.  Apply word vectors for sentiment analysis of APP reviews , 2016, 2016 3rd International Conference on Systems and Informatics (ICSAI).

[12]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[13]  Vasudeva Varma,et al.  Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 , 2013, CLEF.

[14]  John Abraham,et al.  Efficient Processing of Semantic Web Queries in HBase and MySQL Cluster , 2013, IT Professional.

[15]  Sathiamoorthy Manoharan,et al.  A performance comparison of SQL and NoSQL databases , 2013, 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM).

[16]  Yidong Cui,et al.  Distributed storage of network measurement data on HBase , 2012, 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems.

[17]  Krister Lindén,et al.  Discriminating Similar Languages with Token-Based Backoff , 2015 .

[18]  Florin Radulescu,et al.  MongoDB vs Oracle -- Database Comparison , 2012, 2012 Third International Conference on Emerging Intelligent Data and Web Technologies.

[19]  Zhu Wei-ping,et al.  Using MongoDB to implement textbook management system instead of MySQL , 2011, 2011 IEEE 3rd International Conference on Communication Software and Networks.