SP-BRAIN: scalable and reliable implementations of a supervised relevance-based machine learning algorithm

In this work, new implementations of the U-BRAIN (Uncertainty-managing Bach Relevance-Based Artificial Intelligence) supervised machine learning algorithm are described. The implementations, referred as SP-BRAIN (SP stands for Spark), aim to efficiently process large datasets. Given the iterative nature of the algorithm together with its dependence on in-memory data, a non-standard MapReduce paradigm is applied, taking into account several memory and performance problems, e.g., the granularity of the MAP task, the reduction in the shuffling operation, caching, partial data recomputing, and usage of clusters. The implementations benefit the whole Hadoop ecosystem components, such as HDFS, Yarn, and streaming. Testing is performed in cloud execution environments, using different configurations with up to 128 cores. The performance of the new implementations is evaluated on three known datasets, and the findings are compared to the ones of a previous U-BRAIN parallel implementation. The results show a speedup up to 20 × with a good scalability and reliability in cluster environments.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Salvatore Rampone,et al.  Recognition of splice junctions on DNA sequences by BRAIN learning algorithm , 1998, Bioinform..

[3]  Francesco Palmieri,et al.  An uncertainty-managing batch relevance-based approach to network anomaly detection , 2015, Appl. Soft Comput..

[4]  Philip Daly Review: Java Threads , 2000 .

[5]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[6]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Angappa Gunasekaran,et al.  Big Data in Healthcare Management: A Review of Literature , 2018 .

[9]  Giovanni Felici,et al.  CamurWeb: a classification software and a large knowledge base for gene expression data of cancer , 2018, BMC Bioinformatics.

[10]  Gianni D'Angelo,et al.  Diagnosis of aerospace structure defects by a HPC implemented soft computing algorithm , 2014, 2014 IEEE Metrology for Aerospace (MetroAeroSpace).

[11]  Wilson Pardi Programming Concurrent and Distributed Algorithms In Java , 2004, IEEE Distributed Systems Online.

[12]  Salvatore Rampone,et al.  A Comparison of Apache Spark Supervised Machine Learning Algorithms for DNA Splicing Site Prediction , 2020, Neural Approaches to Dynamics of Signal Exchanges.

[13]  Salvatore Rampone,et al.  Neural Network Aided Evaluation Of Landslide Susceptibility In Southern Italy , 2012 .

[14]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[15]  Gianni D'Angelo,et al.  Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications , 2014, BMC Bioinformatics.

[16]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Yahong Luo,et al.  A deep learning method for classifying mammographic breast density categories , 2018, Medical physics.

[19]  Salvatore Rampone,et al.  Hs3d, A Dataset Of Homo Sapiens Splice Regions, And Its Extraction Procedure From A Major Public Database , 2002 .

[20]  Salvatore Rampone,et al.  VLSI implementation of greedy-based distributed routing schemes for ad hoc networks , 2007, Soft Comput..

[21]  Gianni D'Angelo,et al.  A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees , 2019, Soft Comput..

[22]  GrayJim Distributed Computing Economics , 2008 .

[23]  Thomas Herault,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface , 2003, Lecture Notes in Computer Science.

[24]  Miriam A. M. Capretz,et al.  Challenges for MapReduce in Big Data , 2014, 2014 IEEE World Congress on Services.

[25]  S. Rampone,et al.  An error tolerant software equipment for human DNA characterization , 2004, IEEE Transactions on Nuclear Science.

[26]  Miriam A. M. Capretz,et al.  Machine Learning With Big Data: Challenges and Approaches , 2017, IEEE Access.

[27]  David M. Eddy,et al.  Individualized Guidelines: The Potential for Increasing Quality and Reducing Costs , 2011, Annals of Internal Medicine.

[28]  Charles Parker,et al.  Unexpected challenges in large scale machine learning , 2012, BigMine '12.

[29]  Oliver A. McBryan,et al.  An Overview of Message Passing Environments , 1994, Parallel Comput..

[30]  Sven Apel,et al.  Modeling and optimizing MapReduce programs , 2015, Concurr. Comput. Pract. Exp..

[31]  Reynold Xin,et al.  Apache Spark , 2016 .

[32]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[33]  Huanming Yang,et al.  Revealing Alzheimer’s disease genes spectrum in the whole-genome by machine learning , 2018, BMC Neurology.

[34]  Giovanni Felici,et al.  Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis , 2015 .

[35]  Fabio Cumbo,et al.  Classification of large DNA methylation datasets for identifying cancer drivers , 2018, Big Data Res..

[36]  William Pugh,et al.  MPJava: High-Performance Message Passing in Java Using Java.nio , 2003, LCPC.

[37]  Salvatore Rampone,et al.  A fuzzified BRAIN algorithm for learning DNF from incomplete data , 2010, ArXiv.

[38]  Maurice H. T. Ling,et al.  Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics , 2019, Front. Genet..

[39]  Farshad Firouzi,et al.  Internet-of-Things and big data for smarter healthcare: From device to architecture, applications and analytics , 2018, Future Gener. Comput. Syst..

[40]  Salvatore Rampone,et al.  A Web Content Management System for a Geo-Archeological Research Program , 2009 .

[41]  Ciprian Dobre,et al.  Intelligent services for Big Data science , 2014, Future Gener. Comput. Syst..