Genomic big data hitting the storage bottleneck

During the last decades, there is a vast data explosion in bioinformatics. Big data centres are trying to face this data crisis, reaching high storage capacity levels. Although several scientific giants examine how to handle the enormous pile of information in their cupboards, the problem remains unsolved. On a daily basis, there is a massive quantity of permanent loss of extensive information due to infrastructure and storage space problems. The motivation for sequencing has fallen behind. Sometimes, the time that is spent to solve storage space problems is longer than the one dedicated to collect and analyse data. To bring sequencing to the foreground, scientists have to slide over such obstacles and find alternative ways to approach the issue of data volume. Scientific community experiences the data crisis era, where, out of the box solutions may ease the typical research workflow, until technological development meets the needs of Bioinformatics.

[1]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[2]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[3]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[4]  Alex Endert,et al.  VISAGE: Interactive Visual Graph Querying , 2016, AVI.

[5]  James Abello,et al.  Hierarchical graph indexing , 2003, CIKM '03.

[6]  Philip S. Yu,et al.  Graph indexing based on discriminative frequent structure analysis , 2005, TODS.

[7]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[8]  Bo Thiesson,et al.  HyperSAX: Fast Approximate Search of Multidimensional Data , 2015, ICPRAM.

[9]  Eamonn J. Keogh,et al.  iSAX: disk-aware mining and indexing of massive time series datasets , 2009, Data Mining and Knowledge Discovery.

[10]  Karsten Klein,et al.  Scaffold Hunter: a comprehensive visual analytics framework for drug discovery , 2017, Journal of Cheminformatics.

[11]  J. A. Crowther Reports on Progress in Physics , 1941, Nature.

[12]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[13]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[15]  Jeffrey A. Delmerico,et al.  Comparing the performance of clusters, Hadoop, and Active Disks on microarray correlation computations , 2009, 2009 International Conference on High Performance Computing (HiPC).

[16]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[17]  M. J. Turner,et al.  A DBMS For Large Statistical Databases , 1979, Fifth International Conference on Very Large Data Bases, 1979..

[18]  E. Kandel,et al.  Proceedings of the National Academy of Sciences of the United States of America. Annual subject and author indexes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Daniel J. Abadi,et al.  Integrating compression and execution in column-oriented database systems , 2006, SIGMOD Conference.

[20]  Qiang Li,et al.  Genome sequence and genetic diversity of the common carp, Cyprinus carpio , 2014, Nature Genetics.

[21]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[23]  D. Britton,et al.  How to deal with petabytes of data: the LHC Grid project , 2014, Reports on progress in physics. Physical Society.

[24]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[25]  A. Anguera,et al.  Applying data mining techniques to medical time series: an empirical case study in electroencephalography and stabilometry , 2016, Computational and structural biotechnology journal.

[26]  Gamage Upeksha Ganegoda,et al.  New Trends of Digital Data Storage in DNA , 2016, BioMed research international.

[27]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[28]  Steven E. Massey,et al.  DNA/RNA transverse current sequencing: intrinsic structural noise from neighboring bases , 2015, Front. Genet..

[29]  Nasir D. Memon,et al.  NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring , 2010, RAID.

[30]  Suman Nath,et al.  Managing Massive Time Series Streams with MultiScale Compressed Trickles , 2009, Proc. VLDB Endow..

[31]  Meng He,et al.  Indexing Compressed Text , 2003 .

[32]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[33]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[34]  Denilson Barbosa,et al.  Databases and Social Networks , 2011, SIGMOD 2011.

[35]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[36]  Gonzalo Navarro,et al.  Space-efficient construction of Lempel-Ziv compressed text indexes , 2011, Inf. Comput..

[37]  Pradeep Kumar Sreenivasaiah,et al.  Current Trends and New Challenges of Databases and Web Applications for Systems Driven Biological Research , 2010, Front. Physio..