Big data and machine learning framework for clouds and its usage for text classification

Reference architectures for big data and machine learning include not only interconnected building blocks but important considerations (among others) for scalability, manageability and usability issues as well. Leveraging on such reference architectures, the automated deployment of distributed toolsets and frameworks on various clouds is still challenging due to the diversity of technologies and protocols. The paper focuses particularly on the widespread Apache Spark cluster with Jupyter as the particularly addressed framework, and the Occopus cloud‐agnostic orchestrator tool for automating its deployment and maintenance stages. The presented approach has been demonstrated and validated with a new, promising text classification application on the Hungarian academic research infrastructure, the OpenStack‐based MTA Cloud. The paper explains the concept, the applied components, and illustrates their usage with real use‐case measurements.

[1]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Michael W. Mahoney,et al.  Skip-Gram − Zipf + Uniform = Vector Additivity , 2017, ACL.

[3]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[4]  Veronika Vincze,et al.  magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian , 2013, RANLP.

[5]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Henryk Maciejewski,et al.  Distributed Classification of Text Documents on Apache Spark Platform , 2016, ICAISC.

[8]  Claes H. de Vreese,et al.  Using Supervised Machine Learning to Code Policy Issues , 2015 .

[9]  Jan Nicolas Weskamp,et al.  Scalable Analytics Platform for Machine Learning in Smart Production Systems , 2019, 2019 24th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).

[10]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[11]  Róbert Lovas,et al.  Agrodat: A Knowledge Centre and Decision Support System for Precision Farming Based on IoT and Big Data Technologies , 2018, ERCIM News.

[12]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[13]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[14]  Gábor Kertész,et al.  Metric Embedding Learning on Multi-Directional Projections , 2020, Algorithms.

[15]  Daniel Pakkala,et al.  Reference Architecture and Classification of Technologies, Products and Services for Big Data Systems , 2015, Big Data Res..

[16]  Ladislav Hluchý,et al.  Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey , 2019, Artificial Intelligence Review.

[17]  Dana Petcu,et al.  Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data , 2016 .

[18]  Jeffrey M. Perkel,et al.  Why Jupyter is data scientists’ computational notebook of choice , 2018, Nature.

[19]  Péter Kacsuk,et al.  Occopus: a Multi-Cloud Orchestrator to Deploy and Manage Complex Scientific Infrastructures , 2017, Journal of Grid Computing.

[20]  Róbert Lovas,et al.  Cloud agnostic Big Data platform focusing on scalability and cost-efficiency , 2018, Adv. Eng. Softw..

[21]  Diego Scardaci,et al.  The EGI Federated Cloud e-Infrastructure , 2015, Cloud Forward.

[22]  Gábor Terstyánszky,et al.  MiCADO - Microservice-based Cloud Application-level Dynamic Orchestrator , 2017, Future Gener. Comput. Syst..

[23]  Virginijus Marcinkevičius,et al.  Application of Logistic Regression with part-of-the-speech tagging for multi-class text classification , 2016, 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE).

[24]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[25]  Reynold Xin,et al.  Apache Spark , 2016 .

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.