Automated Enzyme Classification by Formal Concept Analysis

Enzymes are molecules with a catalytic activity that make them essential for any biochemical reaction. High throughput genomic technics give access to the protein sequence of new enzymes found in living organisms. Guessing the enzyme functional activity from its sequence is a crucial task that can be approached by comparing the new sequences with those of already known enzymes labeled by a family class. This task is difficult because the activity is based on a combination of small sequence patterns and sequences greatly evolved over time. This paper presents a classifier based on the identification of common subsequence blocks between known and new enzymes and the search of formal concepts built on the cross product of blocks and sequences for each class. Since new enzyme families may emerge, it is important to propose a first classification of enzymes that cannot be assigned to a known family. FCA offer a nice framework to set the task as an optimization problem on the set of concepts. The classifier has been tested with success on a particular set of enzymes present in a large variety of species, the haloacid dehalogenase superfamily.

[1]  Claudio Carpineto,et al.  GALOIS: An Order-Theoretic Approach to Conceptual Clustering , 1993, ICML.

[2]  E V Koonin,et al.  Computer analysis of bacterial haloacid dehalogenases defines a large superfamily of hydrolases with diverse specificity. Application of an iterative approach to database search. , 1994, Journal of molecular biology.

[3]  Satoshi Kobayashi,et al.  Learning local languages and its application to protein /spl alpha/-chain identification , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4]  Mehran Sahami Learning Classification Rules Using Lattices (Extended Abstract) , 1995, ECML.

[5]  Stefan Wrobel,et al.  Machine Learning: ECML-95 , 1995, Lecture Notes in Computer Science.

[6]  Mehran Sahami,et al.  Learning Classification Rules Using Lattices , 1995 .

[7]  Patrick Njiwoua,et al.  Améliorer l'apprentissage à partir d'instances grâce à l'induction de concepts : le système CIBLe , 1999 .

[8]  Mong-Li Lee,et al.  Concept lattice based composite classifiers for high predictability , 2002, J. Exp. Theor. Artif. Intell..

[9]  Damián López,et al.  Protein Motif Prediction by Grammatical Inference , 2006, ICGI.

[10]  Marco W Fraaije,et al.  Occurrence and biocatalytic potential of carbohydrate oxidases. , 2006, Advances in applied microbiology.

[11]  Liran Carmel,et al.  Genome-wide Analysis of Substrate Specificities of the Escherichia coli Haloacid Dehalogenase-like Phosphatase Family* , 2006, Journal of Biological Chemistry.

[12]  Karen N. Allen,et al.  Evolutionary genomics of the HAD superfamily: understanding the structural adaptations and catalytic diversity in a superfamily of phosphoesterases and allied enzymes. , 2006, Journal of molecular biology.

[13]  Dick B Janssen,et al.  Biocatalysis by dehalogenating enzymes. , 2007, Advances in applied microbiology.

[14]  Damián López,et al.  IgTM: An algorithm to predict transmembrane domains and topology in proteins , 2008, BMC Bioinformatics.

[15]  L. Kovács Generating decision tree from lattice for classification , 2007 .

[16]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[17]  Keun Ho Ryu,et al.  Classification of Enzyme Function from Protein Sequence based on Feature Representation , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[18]  Goulven Kerbellec,et al.  Apprentissage d'automates modélisant des familles de séquences protéiques. (Learning automata modelling families of protein sequences) , 2008 .

[19]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[20]  Keun Ho Ryu,et al.  Design of a Novel Protein Feature and Enzyme Function Classification , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[21]  Jeff Z. Pan,et al.  An Argument-Based Approach to Using Multiple Ontologies , 2009, SUM.

[22]  Camille Roth,et al.  Approaches to the Selection of Relevant Concepts in the Case of Noisy Data , 2010, ICFCA.

[23]  Corinne Da Silva,et al.  The Ectocarpus genome and the independent evolution of multicellularity in brown algae , 2010, Nature.

[24]  Kyungsook Han,et al.  Bio-Inspired Computing and Applications , 2011, Lecture Notes in Computer Science.

[25]  Miroslaw Truszczynski,et al.  Answer set programming at a glance , 2011, Commun. ACM.

[26]  Jiye Liang,et al.  Closed-Label Concept Lattice Based Rule Extraction Approach , 2011, ICIC.

[27]  Martin Gebser,et al.  Conflict-driven answer set solving: From theory to practice , 2012, Artif. Intell..

[28]  Henri Prade,et al.  Clustering Sets of Objects Using Concepts-Objects Bipartite Graphs , 2012, SUM.

[29]  Chetan Kumar,et al.  A top-down approach to classify enzyme functional classes and sub-classes using random forest , 2012, EURASIP J. Bioinform. Syst. Biol..

[30]  J. Schultz,et al.  Human HAD phosphatases: structure, mechanism, and roles in health and disease , 2013, The FEBS journal.

[31]  Madori Ikeda,et al.  Classification by Selecting Plausible Formal Concepts in a Concept Lattice , 2013 .

[32]  Henri Prade,et al.  Clustering bipartite graphs in terms of approximate formal concepts and sub-contexts , 2013, Int. J. Comput. Intell. Syst..

[33]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[34]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[35]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[36]  Charlotte Truchet,et al.  Revue d'Intelligence Artificielle , 2014 .