Proteomics is the study of the structure and behavior of proteins, and one of the primary approaches to protein identification and quantification is through the analysis of Mass Spectrometry (MS) data. This analysis typically involves a series of different computational steps, and the Purdue University Bindley Bioscience Center employs a computational workflow system, the Omics Discovery Pipeline (ODP), to assist in its analysis of MS data. One of the ODP's stages entails aligning the peaks in the MS data across multiple subjects, and due to the large number of subjects that may be used in a study and the large number of peaks found in each subject's corresponding MS data, the alignment step qualifies as a data intensive computation. This research focuses on using Apache Hadoop MapReduce to align the processed MS data in a computationally faster manner than the serial approach currently used in the ODP.
[1]
Christos Faloutsos,et al.
Clustering very large multi-dimensional datasets with MapReduce
,
2011,
KDD.
[2]
Mark Whitehorn,et al.
Near Real-Time Processing of Proteomics Data Using Hadoop
,
2014,
Big Data.
[3]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[4]
John A. Springer,et al.
pXAlign: A parallel implementation of XAlign
,
2013,
BCB.
[5]
Xiang Zhang,et al.
Data pre-processing in liquid chromatography-mass spectrometry-based proteomics
,
2005,
Bioinform..
[6]
Sanjay Ghemawat,et al.
MapReduce: simplified data processing on large clusters
,
2008,
CACM.
[7]
Neoklis Polyzotis,et al.
Iterative MapReduce for Large Scale Machine Learning
,
2013,
ArXiv.