Identification of miRNA signature using Next-Generation Sequencing data of prostate cancer

MicroRNAs (miRNAs) are a class of ~22-nucleotide endogenous noncoding RNAs which have critical functions across various biological processes. It is quite well-known that the miRNAs are playing a crucial role for regulating the expression of target gene via repressing translation or promoting messenger RNAs degradation. Therefore, identification of discriminative and differentially expressed miRNA as a signature is an important task for cancer therapy. In this regard, Next-Generation Sequencing (NGS) data of miRNAs, available at The Cancer Research Atlas (TCGA) repository, is analyzed here for prostate cancer. This cancer type is a serious threat to the health of men as found in the literature. Hence, finding miRNA signature using NGS based miRNA expression data for prostate cancer is an important research direction. Generally by motivating this fact, a new miRNA signature identification method for prostate cancer is proposed. The proposed method uses a global optimization technique, called Simulated Annealing (SA), Principal Component Analysis (PCA) and Support Vector Machine (SVM) classifier. Here SA encodes L number of features, in this case miRNAs. Similar number of top L key principal components of the original dataset is extracted using PCA. Thereafter, such components are multiplied with the reduced subset of data so that the classification task can be done on diverse dataset using SVM. Here the classification accuracy of SVM is considered as an underlying objective to optimize using SA. The proposed method can be seen as feature section technique in order to find potential miRNA signature. Finally, the experimental results provide a set of miRNAs with optimal classification accuracy. However, due to the stochastic nature of this algorithm a list of miRNAs is prepared. From the top 15 miRNAs of that list, four miRNAs, hsa-mir-152, hsa-mir-23a, hsa-mir-302f and hsa-mir-101-1, are associated with prostate cancer. Moreover, the performance of the proposed method has also been compared with other widely used state-of-the-art techniques. Furthermore, the obtained results have been justified by means of statistical test along with biological significance tests for the selected miRNAs.

[1]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[2]  Ujjwal Maulik,et al.  A new multi-objective technique for differential fuzzy clustering , 2011, Appl. Soft Comput..

[3]  Ujjwal Maulik,et al.  Improved differential evolution for microarray analysis , 2012, Int. J. Data Min. Bioinform..

[4]  Dinesh Gupta,et al.  Machine Learning Methods for Prediction of CDK-Inhibitors , 2010, PloS one.

[5]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[6]  Ujjwal Maulik,et al.  Application of RotaSVM for HLA Class II Protein-Peptide Interaction Prediction , 2014, BIOINFORMATICS.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[9]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[10]  Scott Kirkpatrick,et al.  Optimization by simulated annealing: Quantitative studies , 1984 .

[11]  Indrajit Saha,et al.  A new evolutionary gene selection technique , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[12]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[13]  Ujjwal Maulik,et al.  MaER: A New Ensemble Based Multiclass Classifier for Binding Activity Prediction of HLA Class II Proteins , 2015, PReMI.

[14]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[15]  I ScottKirkpatrick Optimization by Simulated Annealing: Quantitative Studies , 1984 .

[16]  Ayman Grada,et al.  Next-generation sequencing: methodology and application. , 2013, The Journal of investigative dermatology.

[17]  Ujjwal Maulik,et al.  Ensemble learning prediction of protein-protein interactions using proteins functional annotations. , 2014, Molecular bioSystems.

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  J. Moody,et al.  Feature Selection Based on Joint Mutual Information , 1999 .

[20]  Ujjwal Maulik,et al.  Binding Activity Prediction of Cyclin-Dependent Inhibitors , 2015, J. Chem. Inf. Model..

[21]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[22]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[23]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[24]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.