Construction d'attributs pour l'extraction de connaissances à partir de séquences biologiques

Résumé. Dans cet article nous étudions un problème de prétraitement de données : la construction d’attributs décrivant des séquences biologiques. Afin d’assurer l’extraction de connaissances à partir de séquences biologiques (ADN, ARN et protéines), tout système de fouille de données (datamining) se confronte à la représentation non habituelle de ce type de données. Une séquence biologique est représentée, en structure primaire, par une chaîne de caractères. La construction d’attributs décrivant les séquences biologiques est une étape de prétraitement inévitable. Dans cet article, nous étudions les méthodes existantes de construction d’attributs décrivant des séquences biologiques, notamment, celles qui se basent sur les n-grammes, l’arbre de suffixes généralisés et les modèles de Markov cachés. Notre contribution dans cet axe a été la proposition de la méthode des descripteurs discriminants et la présentation d’une étude comparative approfondie de ces méthodes en les appliquant à des problèmes biologiques typiques comme la reconnaissance de sites promoteurs des gènes de E. Coli, la reconnaissance de sites de jonction de Primate et la classification des protéines. Une confrontation des résultats de chaque méthode avec la banque de motifs Pfam sera aussi présentée.

[1]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[2]  R. Ulevitch,et al.  Identification of hTLR10: a novel human Toll-like receptor preferentially expressed in immune cells. , 2001, Biochimica et biophysica acta.

[3]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[4]  Mourad Elloumi,et al.  Encoding of primary structures of biological macromolecules within a data mining perspective , 2008, Journal of Computer Science and Technology.

[5]  El-Ghazali Talbi,et al.  A data mining approach to discover genetic and environmental factors involved in multifactorial diseases , 2002, Knowl. Based Syst..

[6]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[7]  Y. Poovorawan,et al.  MODELS FOR DISCOVERING SIGNATURE PATTERNS IN DNA SEQUENCES , 2004 .

[8]  Mourad Elloumi,et al.  A data mining approach based on machine learning techniques to classify biological sequences , 2002, Knowl. Based Syst..

[9]  A. Cherkasov,et al.  A phosphorylation site in the Toll-like receptor 5 TIR domain is required for inflammatory signalling in response to flagellin. , 2007, Biochemical and biophysical research communications.

[10]  Jiebo Luo,et al.  Data Mining. Multimedia, Soft Computing, and Bioinformatics , 2005, IEEE Transactions on Neural Networks.

[11]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jason Tsong-Li Wang,et al.  GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences , 2004, Inf. Sci..

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  A. Zanasi Text Mining and its Applications to Intelligence, CRM and Knowledge Management , 2007 .

[15]  Yin-Fu Huang,et al.  Mining sequential patterns using graph search techniques , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[16]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[17]  Engelbert Mephu Nguifo,et al.  Clustering Binary Codes to Express the Biochemical Properties of Amino Acids , 2004, Intelligent Information Processing.

[18]  M. O'Neill,et al.  Escherichia coli promoters. II. A spacing class-dependent promoter search protocol. , 1989, The Journal of biological chemistry.

[19]  A. Bairoch,et al.  PROSITE: recent developments. , 1994, Nucleic acids research.

[20]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[21]  P. Christen,et al.  Multiple evolutionary origin of pyridoxal-5'-phosphate-dependent amino acid decarboxylases. , 1994, European journal of biochemistry.

[22]  Clifford A. Meyer,et al.  A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences , 2005, ISMB.

[23]  Dennis Shasha,et al.  Application of neural networks to biological data mining: a case study in protein sequence classification , 2000, KDD '00.

[24]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[25]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[26]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[27]  Geoffrey G. Towell,et al.  Symbolic knowledge and neural networks: insertion, refinement and extraction , 1992 .

[28]  Ricco Rakotomalala,et al.  Cadre pour la catégorisation de textes multilingues , 2004 .

[29]  Dennis Shasha,et al.  New Techniques for DNA Sequence Classification , 1999, J. Comput. Biol..

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[32]  David Page,et al.  Using Machine Learning to Design and Interpret Gene-Expression Microarrays , 2004, AI Mag..

[33]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[34]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[35]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..