Applications in Data-Intensive Computing

Abstract The total quantity of digital information in the world is growing at an alarming rate. Scientists and engineers are contributing heavily to this data “tsunami” by gathering data using computing and instrumentation at incredible rates. As data volumes and complexity grow, it is increasingly arduous to extract valuable information from the data and derive knowledge from that data. Addressing these demands of ever-growing data volumes and complexity requires game-changing advances in software, hardware, and algorithms. Solution technologies also must scale to handle the increased data collection and processing rates and simultaneously accelerate timely and effective analysis results. This need for ever faster data processing and manipulation as well as algorithms that scale to high-volume data sets have given birth to a new paradigm or discipline known as “data-intensive computing.” In this chapter, we define data-intensive computing, identify the challenges of massive data, outline solutions for hardware, software, and analytics, and discuss a number of applications in the areas of biology, cyber security, and atmospheric research.

[1]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[2]  G. Conti,et al.  Real-time and forensic network data analysis using animated and coordinated visualization , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[3]  E. Clothiaux,et al.  The Atmospheric Radiation Measurement Program Cloud Profiling Radars: Second-Generation Sampling Strategies, Processing, and Cloud Data Products , 2007 .

[4]  R. Beavis,et al.  Using annotated peptide mass spectrum libraries for protein identification. , 2006, Journal of proteome research.

[5]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004 .

[6]  Yan Gao,et al.  Predicting the intrusion intentions by observing system call sequences , 2004, Comput. Secur..

[7]  J. Yates,et al.  A method for the comprehensive proteomic analysis of membrane proteins , 2003, Nature Biotechnology.

[8]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[9]  Jarek Nieplocha,et al.  ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis , 2006, IEEE Transactions on Parallel and Distributed Systems.

[10]  Joseph M. Lancaster,et al.  Mercury BLASTN: Faster DNA Sequence Comparison using a Streaming Hardware Architecture , 2007 .

[11]  A. Alaiya,et al.  Clinical cancer proteomics: promises and pitfalls. , 2005, Journal of proteome research.

[12]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.

[13]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[14]  Jarek Nieplocha,et al.  Probability Convergence in a Multithreaded Counting Application , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[15]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[16]  Hai Zhuge,et al.  Fighting Epidemics in the Information and Knowledge Age , 2003, Computer.

[17]  Adam Wynne,et al.  The MeDICi Integration Framework: A Platform for High Performance Data Streaming Applications , 2008, Seventh Working IEEE/IFIP Conference on Software Architecture (WICSA 2008).

[18]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[19]  Gunther Schmidt,et al.  Relations and Graphs , 1993, EATCS Monographs on Theoretical Computer Science.

[20]  Jim Gray,et al.  2020 Computing: Science in an exponential world , 2006, Nature.

[21]  John T. Stasko,et al.  Countering security information overload through alert and packet visualization , 2006, IEEE Computer Graphics and Applications.

[22]  Muhammad Ali Babar,et al.  Software Architecture Review: The State of Practice , 2009, Computer.

[23]  M. Elwood Proteomic patterns in serum and identification of ovarian cancer , 2002, The Lancet.

[24]  Chris North,et al.  Visualizing Biological Pathways: Requirements Analysis, Systems Evaluation and Research Agenda , 2005, Inf. Vis..

[25]  M. Lanzagorta,et al.  Early Experience with Scientific Programs on the Cray MTA-2 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Daniel Coca,et al.  Hardware acceleration of processing of mass spectrometric data for proteomics , 2007, Bioinform..

[28]  Dionisios N. Pnevmatikatos,et al.  Fast, Large-Scale String Match for a 10Gbps FPGA-Based Network Intrusion Detection System , 2003, FPL.

[29]  Yang Liu,et al.  GPU Accelerated Smith-Waterman , 2006, International Conference on Computational Science.

[30]  Richard D. Smith,et al.  Robust algorithm for alignment of liquid chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. , 2006, Analytical chemistry.

[31]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[32]  I-Min A. Chen,et al.  IMG/M: a data management and analysis system for metagenomes , 2007, Nucleic Acids Res..

[33]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[34]  Richard D. Smith,et al.  Advances in proteomics data analysis and display using an accurate mass and time tag approach. , 2006, Mass spectrometry reviews.

[35]  P. Tempst,et al.  Correcting common errors in identifying cancer-specific serum peptide signatures. , 2005, Journal of proteome research.

[36]  McDonald Wh,et al.  Shotgun proteomics: integrating technologies to answer biological questions. , 2003, Current opinion in molecular therapeutics.

[37]  M. Senko,et al.  Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions , 1995, Journal of the American Society for Mass Spectrometry.

[38]  J. Yates,et al.  An automated multidimensional protein identification technology for shotgun proteomics. , 2001, Analytical chemistry.

[39]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[40]  Sivan Toledo,et al.  Characterizing the Performance of Flash Memory Storage Devices and Its Impact on Algorithm Design , 2008, WEA.

[41]  Alejandro Heredia-Langner,et al.  Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data. , 2005, Journal of proteome research.

[42]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[43]  Charles N. Long,et al.  An Automated Quality Assessment and Control Algorithm for Surface Radiation Measurements , 2008 .

[44]  A.R. Shah,et al.  High-throughput computation of pairwise sequence similarities for multiple genome comparisons using ScalaBLAST , 2007, 2007 IEEE/NIH Life Science Systems and Applications Workshop.

[45]  Douglas C Pearl,et al.  Proteomic patterns in serum and identification of ovarian cancer , 2002, The Lancet.

[46]  Brian H. Clowers,et al.  Hadamard Transform Ion Mobility Spectrometry , 2006 .

[47]  Maya Gokhale,et al.  Hardware Technologies for High-Performance Data-Intensive Computing , 2008, Computer.

[48]  Bobbie-Jo M Webb-Robertson,et al.  A feature vector integration approach for a generalized support vector machine pairwise homology algorithm , 2008, Comput. Biol. Chem..

[49]  Christopher S. Oehmen,et al.  Bringing high-performance computing to the biologist's workbench: approaches, applications, and challenges , 2008 .

[50]  Mudita Singhal,et al.  An Extensible, Scalable Architecture for Managing Bioinformatics Data and Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[51]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[52]  Anita D. D'Amico,et al.  The Real Work of Computer Network Defense Analysts , 2007, VizSEC.

[53]  Ning Lu,et al.  Safeguarding SCADA Systems with Anomaly Detection , 2003, MMM-ACNS.

[54]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[55]  Adam Wynne,et al.  Services + Components = Data Intensive Scientific Workflow Applications with MeDICi , 2009, CBSE.

[56]  Navdeep Jaitly,et al.  VIPER: an advanced software package to support high-throughput LC-MS peptide identification , 2007, Bioinform..

[57]  Kurt Keutzer,et al.  Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors , 2008 .

[58]  Linda Dailey Paulson Researchers develop network-security visualization tools , 2004 .

[59]  Eric S. Lander,et al.  On the sequencing of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Frank Vitzthum,et al.  Proteomics: from basic research to diagnostic application. A review of requirements & needs. , 2005, Journal of proteome research.

[61]  Qiang Liu,et al.  Digital Media Indexing on the Cell Processor , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[62]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[63]  Thomas P Ackerman The Role of Global Observations for Climate and Other Applications , 2005 .

[64]  Kevin W. Boyack,et al.  Data-centric computing with the Netezza architecture. , 2006 .

[65]  Inna Dubchak,et al.  The integrated microbial genomes (IMG) system , 2005, Nucleic Acids Res..

[66]  Dimitrios Soudris,et al.  A Novel Methodology for Exploring Interconnection Architectures Targeting 3-D FPGAs , 2009 .

[67]  Fabrizio Petrini,et al.  Peak-Performance DFA-based String Matching on the Cell Processor , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[68]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[69]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples , 2008, PloS one.

[70]  Ruedi Aebersold,et al.  Perspective: a program to improve protein biomarker discovery for cancer. , 2005, Journal of proteome research.

[71]  Sudhir Srivastava,et al.  Proteomics in the forefront of cancer biomarker discovery. , 2005, Journal of proteome research.

[72]  Stephen H. Muggleton,et al.  2020 Computing: Exceeding human limits , 2006, Nature.

[73]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[74]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[75]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[76]  J. Gregory Sidak,et al.  Video Games: Serious Business for America's Economy , 2007 .

[77]  Kesheng Wang,et al.  Knowledge Enterprise: Intelligent Strategies in Product Design, Manufacturing, and Management, Proceedings of PROLAMAT 2006, IFIP TC5 International Conference, June 15-17, 2006, Shanghai, China , 2006, PROLAMAT.

[78]  Sheila Vaidya,et al.  The Evaluation of GPU-Based Programming Environments for Knowledge Discovery , 2004 .

[79]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[80]  Yoshiya Oda,et al.  Quantitative proteomics using mass spectrometry. , 2003, Current opinion in chemical biology.

[81]  M Macduff,et al.  ACRF Data Collection and Processing Infrastructure , 2004 .

[82]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[83]  Mukesh Verma,et al.  Proteomic maps of the cancer-associated infectious agents. , 2005, Journal of proteome research.

[84]  Fabrizio Petrini,et al.  Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[85]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[86]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[87]  Richard D. Smith,et al.  Rapid Calculation of Isotope Distributions , 1995 .

[88]  Benjamin A. Stern,et al.  Interactive Data Language , 2000 .

[89]  Alexander S. Szalay,et al.  Petabyte Scale Data Mining: Dream or Reality? , 2002, SPIE Astronomical Telescopes + Instrumentation.

[90]  Richard D. Smith,et al.  Proteomics by FTICR mass spectrometry: top down and bottom up. , 2005, Mass spectrometry reviews.

[91]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[92]  Yang Liu,et al.  Speech recognition systems on the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[93]  F. McLafferty,et al.  Automated assignment of charge states from resolved isotopic peaks for multiply charged ions , 1995, Journal of the American Society for Mass Spectrometry.

[94]  Karen Kafadar,et al.  Visualizing "typical" and "exotic" Internet traffic data , 2006, Comput. Stat. Data Anal..

[95]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[96]  Maya Gokhale,et al.  Threading Opportunities in High-Performance Flash-Memory Storage. , 2008 .

[97]  Paul D. Whitney,et al.  Toward the Routine Analysis of Diverse Data Types , 2003 .

[98]  Sushil Jajodia,et al.  Multiple coordinated views for network attack graphs , 2005, IEEE Workshop on Visualization for Computer Security, 2005. (VizSEC 05)..

[99]  Mark D. Ivey,et al.  Quality Assurance of ARM Program Climate Research Facility Data , 2008 .

[100]  J. Yates,et al.  Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis. , 1998, Analytical chemistry.

[101]  Jarek Nieplocha,et al.  Early experience with out-of-core applications on the cray XMT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[102]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[103]  David J. Marchette,et al.  On Some Techniques for Streaming Data: A Case Study of Internet Packet Headers , 2003 .

[104]  Ilan Beer,et al.  Improving large‐scale proteomics by clustering of mass spectrometry data , 2004, Proteomics.

[105]  Steven A Carr,et al.  Place of pattern in proteomic biomarker discovery. , 2005, Journal of proteome research.

[106]  Yang Chen,et al.  Parallel Sequence Alignment Algorithm for Clustering System , 2006, PROLAMAT.

[107]  Tao Liu,et al.  Utilizing human blood plasma for proteomic biomarker discovery. , 2005, Journal of proteome research.

[108]  Richard D. Smith,et al.  Proteomic analyses using an accurate mass and time tag strategy. , 2004, BioTechniques.

[109]  F. McLafferty,et al.  Automated reduction and interpretation of , 2000, Journal of the American Society for Mass Spectrometry.

[110]  John W. Lockwood,et al.  HAIL: a hardware-accelerated algorithm for language identification , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[111]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[112]  Joshua LaBaer,et al.  The sentinel within: exploiting the immune system for cancer biomarkers. , 2005, Journal of proteome research.

[113]  Edward J. Wegman,et al.  Huge Data Sets and the Frontiers of Computational Feasibility , 1995 .

[114]  Adam Wynne,et al.  Kepler + MeDICi Service-Oriented Scientific Workflow Applications , 2009, 2009 Congress on Services - I.

[115]  Ying Xu,et al.  Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information. , 2006, Analytical chemistry.

[116]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[117]  William A. Pike,et al.  Putting Security in Context: Visual Correlation of Network Activity with Real-World Information , 2007, VizSEC.

[118]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[119]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[120]  P. Hildebrand,et al.  Objective Determination of the Noise Level in Doppler Spectra , 1974 .

[121]  Christopher S. Oehmen,et al.  SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection , 2008, Bioinform..