Automated detection of cancerous genomic sequences using genomic signal processing and machine learning

Abstract Missense mutations are the primary cause of cancer. Identification of mutation in gene sequences is the preliminary step in diagnosis of cancer. In order to identify mutation we need to differentiate between cancerous and non-cancerous gene sequences. Identification of mutation by sequence comparison method can only be possible if the existing variant repeats. If there are no homologous variants present, using a sequence identification method, it is difficult to distinguish cancerous and non-cancerous sequences. Here we have used DWT based Genomic Signal Processing techniques to identify a pattern in the characteristics of the sequences, which in turn can be used with machine learning algorithm to differentiate between cancerous and non-cancerous sequences. The cancerous and non-cancerous gene sequences for lung cancer, breast cancer and ovarian cancer are obtained from NCBI. After performing numerical mapping for the sequences, four level DWT is applied using Haar wavelet and statistical features like mean, median, standard deviation, inter quartile range, skewness and kurtosis are obtained from the wavelet domain. These statistical values when applied to machine learning algorithms resulted in the accuracy of 100% on classification of cancerous and non-cancerous sequences with Support Vector Machine.