Clustering Distributed Short Time Series with Dense Patterns

The clustering of genes with similar temporal profiles is an important task in gene expression data analysis. Current approaches to the clustering of sparse gene expression data with temporal information suffer from their at least quadratic complexity in the number of clusters, the number of genes, or both, and are not distributed. In this paper, we present the first distributed and density-based approach to short time series clustering, called DTSCluster, which is suitable for gene expression data. DTSCluster identifies dense patterns in the distributed datasets and uses them to generate the time series clusters. The comparative experimental results revealed that DTSCluster is scalable in the dataset size with linear complexity in time and space, and outperforms other representative approaches in terms of cluster validation with the silhouette index as well. The distributed scenario also opens up the opportunity for collaborative data mining between different gene expression data holders.

[1]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[2]  Zheng Li,et al.  Short time-series microarray analysis: Methods and challenges , 2008, BMC Systems Biology.

[3]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[4]  Xiaoqian Jiang,et al.  Secure Multi-pArty Computation Grid LOgistic REgression (SMAC-GLORE) , 2016, BMC Medical Informatics and Decision Making.

[5]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[6]  Yihua Zhang,et al.  Secure distributed genome analysis for GWAS and sequence comparison computation , 2015, BMC Medical Informatics and Decision Making.

[7]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[8]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[9]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[10]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[11]  Yi Huang,et al.  Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm , 2012, BMC Bioinformatics.

[12]  Ying Wah Teh,et al.  Time-series clustering - A decade review , 2015, Inf. Syst..

[13]  Inderjit S. Dhillon,et al.  Clustering to forecast sparse time-series data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[14]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[15]  Sergey Malinchik,et al.  SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model , 2013, 2013 IEEE 13th International Conference on Data Mining.

[16]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[17]  N. Rajpoot,et al.  Bayesian Hierarchical Clustering for Studying Cancer Gene Expression Data with Unknown Statistics , 2013, PloS one.

[18]  D. Corey,et al.  RNA sequencing: platform selection, experimental design, and data interpretation. , 2012, Nucleic acid therapeutics.

[19]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[20]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[21]  Ricardo J. G. B. Campello,et al.  On the selection of appropriate distances for gene expression data clustering , 2014, BMC Bioinformatics.

[22]  Yuchen Zhang,et al.  HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS , 2015, Bioinform..

[23]  Teh Ying Wah,et al.  A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data , 2015, PloS one.

[24]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[25]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[26]  Eamonn J. Keogh,et al.  Fast Shapelets: A Scalable Algorithm for Discovering Time Series Shapelets , 2013, SDM.

[27]  Thomas E. Nichols,et al.  The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data , 2014, Brain Imaging and Behavior.

[28]  Ziv Bar-Joseph,et al.  STEM: a tool for the analysis of short time series gene expression data , 2006, BMC Bioinformatics.

[29]  Marianthi Markatou,et al.  Time-series clustering of gene expression in irradiated and bystander fibroblasts: an application of FBPA clustering , 2011, BMC Genomics.

[30]  Eamonn J. Keogh,et al.  Clustering Time Series Using Unsupervised-Shapelets , 2012, 2012 IEEE 12th International Conference on Data Mining.

[31]  Kwang-Hyun Cho,et al.  Microarray data clustering based on temporal variation: FCV with TSD preclustering. , 2003, Applied bioinformatics.

[32]  Carlos Agón,et al.  Time-series data mining , 2012, CSUR.

[33]  Runze Li,et al.  How to cluster gene expression dynamics in response to environmental signals , 2012, Briefings Bioinform..

[34]  Alexander Mendiburu,et al.  Similarity Measure Selection for Clustering Time Series Databases , 2016, IEEE Transactions on Knowledge and Data Engineering.

[35]  Toyoaki Nishida,et al.  Approximately Recurring Motif Discovery Using Shift Density Estimation , 2013, IEA/AIE.

[36]  Agostino Poggi,et al.  JADE - A Java Agent Development Framework , 2005, Multi-Agent Programming.

[37]  James Large,et al.  The Great Time Series Classification Bake Off: An Experimental Evaluation of Recently Proposed Algorithms. Extended Version , 2016, ArXiv.

[38]  Jia-Shung Wang,et al.  Interpolation based consensus clustering for gene expression time series , 2015, BMC Bioinformatics.

[39]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[40]  Dimitrios Gunopulos,et al.  A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series , 2003 .

[41]  Qinghua Hu,et al.  Kernel sparse representation for time series classification , 2015, Inf. Sci..

[42]  Chad L. Myers,et al.  Comparison of Profile Similarity Measures for Genetic Interaction Networks , 2013, PloS one.

[43]  Geeta Sikka,et al.  Recent Techniques of Clustering of Time Series Data: A Survey , 2012 .

[44]  Anne M. Denton,et al.  Pattern-based time-series subsequence clustering using radial distribution functions , 2009, Knowledge and Information Systems.

[45]  Panayiotis V. Benos,et al.  Extracting biologically significant patterns from short time series gene expression data , 2009, BMC Bioinformatics.

[46]  Christoffer Bro,et al.  Transcriptional, Proteomic, and Metabolic Responses to Lithium in Galactose-grown Yeast Cells* , 2003, Journal of Biological Chemistry.

[47]  Eamonn J. Keogh,et al.  Mining motifs in massive time series databases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[48]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[49]  Wolfgang Kastner,et al.  Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns , 2013 .

[50]  Mark J. van der Laan,et al.  A Method to Identify Significant Clusters in Gene Expression Data , 2002 .

[51]  Alexander Schliep,et al.  Analyzing Gene Expression Time-Courses , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..