Automatic Extraction of Highly Predictive Sequence Features that Incorporate Contiguity and Mutation

This paper investigates the problem of extracting sequence features that can be useful in the construction of prediction models. The method introduced in this paper generates such features by considering contiguous subsequences and their mutations, and by selecting those candidate features that have a strong association with the classification target according to the Gini index. Experimental results on three genetic data sets provide evidence of the superiority of this method over other sequence feature generation methods from the li-terature, especially in domains where presence, not specific location, of features within a sequence is pertinent for classification.

[1]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[2]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[3]  W. J. Kent,et al.  Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR , 2004, Science.

[4]  Richard Durbin,et al.  © 2012 Landes Bioscience. Do not distribute. WormBase Annotating many nematode genomes , 2022 .

[5]  Jian Pei,et al.  A brief survey on sequence classification , 2010, SKDD.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[8]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.

[9]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[10]  Jian Pei,et al.  Sequence Data Mining , 2007, Advances in Database Systems.

[11]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[12]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[13]  Chien-Yu Chen,et al.  Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[14]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[15]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[16]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[17]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[18]  Christian A. Grove,et al.  Insight into transcription factor gene duplication from Caenorhabditis elegans Promoterome-driven expression patterns , 2007, BMC Genomics.

[19]  Nansheng Chen,et al.  Characterization of the octamer, a cis-regulatory element that modulates excretory cell gene-expression in Caenorhabditis elegans , 2010, BMC Molecular Biology.

[20]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[21]  James Bailey,et al.  Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints , 2005, ICDM.