Modern Data Formats for Big Bioinformatics Data Analytics

Next Generation Sequencing (NGS) technology has resulted in massive amounts of proteomics and genomics data. This data is of no use if it is not properly analyzed. ETL (Extraction, Transformation, Loading) is an important step in designing data analytics applications. ETL requires proper understanding of features of data. Data format plays a key role in understanding of data, representation of data, space required to store data, data I/O during processing of data, intermediate results of processing, in-memory analysis of data and overall time required to process data. Different data mining and machine learning algorithms require input data in specific types and formats. This paper explores the data formats used by different tools and algorithms and also presents modern data formats that are used on Big Data Platform. It will help researchers and developers in choosing appropriate data format to be used for a particular tool or algorithm.

[1]  Haoyu Xu,et al.  Design and Experiment Analysis of a Hadoop-Based Video Transcoding System for Next-Generation Wireless Sensor Networks , 2014, Int. J. Distributed Sens. Networks.

[2]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[3]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[4]  Yugang Dai,et al.  The naive Bayes text classification algorithm based on rough set in the cloud platform , 2014 .

[5]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[6]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[7]  Jake Luo,et al.  Big Data Application in Biomedical Research and Health Care: A Literature Review , 2016, Biomedical informatics insights.

[8]  Chunming Rong,et al.  K-means Clustering in the Cloud -- A Mahout Test , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[9]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[10]  Silviu Maniu,et al.  StreamDM: Advanced Data Mining in Spark Streaming , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[11]  Ibrahim Aljarah,et al.  Parallel glowworm swarm optimization clustering algorithm based on MapReduce , 2014, 2014 IEEE Symposium on Swarm Intelligence.

[12]  Ho-Jin Choi,et al.  Cloud Technology for Mining Association Rules in Microarray Gene Expression Datasets , 2012 .

[13]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[14]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[15]  Shrideep Pallickara,et al.  On the performance of high dimensional data clustering and classification algorithms , 2013, Future Gener. Comput. Syst..

[16]  Ali Abbas,et al.  Need and Role of Scala Implementations in Bioinformatics , 2017 .

[17]  Song Yaqi Fast Type Recognition of Missive Insulator Leakage Current Data Using Spark , 2016 .

[18]  Xue-wen Chen,et al.  Large-Scale Deep Belief Nets With MapReduce , 2014, IEEE Access.