S^2FS : Single Score Feature Selection Applied to the Problem of Distinguishing Long Non-coding RNAs from Protein Coding Transcripts

The task of distinguishing long non-coding RNAs (lncRNAs) from protein coding transcripts (PCTs) has been previously addressed with machine learning (ML) algorithms using hundreds of features. However, the use of a large number of features can negatively affect the predictive performance of these algorithms since it can lead to problems like overfitting due to a phenomenon known as the curse of dimensionality. In order to deal with these problems, dimensionality reduction techniques have been proposed, among them, feature selection. This work proposes and experimentally evaluates a simple and fast feature selection technique, called Single Score Feature Selection - \(S^2FS\).

[1]  Lennart Martens,et al.  LNCipedia: a database for annotated human lncRNA transcript sequences and structures , 2012, Nucleic Acids Res..

[2]  J. Rinn,et al.  Discovery and annotation of long noncoding RNAs , 2015, Nature Structural &Molecular Biology.

[3]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[4]  Pritish Kumar Varadwaj,et al.  DeepLNC, a long non-coding RNA prediction tool using deep neural network , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[5]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[8]  Howard Y. Chang,et al.  Genome regulation by long noncoding RNAs. , 2012, Annual review of biochemistry.

[9]  Nikos E. Mastorakis,et al.  Multilayer perceptron and neural networks , 2009 .

[10]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[11]  C. Ponting,et al.  Evolution and Functions of Long Noncoding RNAs , 2009, Cell.

[12]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  C. Glass,et al.  Non-coding RNAs as regulators of gene expression and epigenetics. , 2011, Cardiovascular research.

[16]  Koji Kashihara,et al.  Automatic design of an effective image filter based on an evolutionary algorithm for venous analysis , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[17]  Peter F Stadler,et al.  A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts , 2017, BMC Genomics.

[18]  M. Esteller Non-coding RNAs in human disease , 2011, Nature Reviews Genetics.

[19]  J. Mattick Non‐coding RNAs: the architects of eukaryotic complexity , 2001, EMBO reports.

[20]  Yanchun Liang,et al.  Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Long Noncoding Transcripts Discrimination , 2016, BioMed research international.

[21]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .