A dissimilarity-based classifier for generalized sequences by a granular computing approach

In this paper we propose a classifier for generalized sequences that is conceived in the granular computing framework. The classification system processes the input sequences of objects by means of a suited interplay among dissimilarity and clustering based techniques. The core data mining engine retrieves information granules that are used to represent the input sequences as feature vectors. Such a representation allows to deal with the original sequence classification problem through standard pattern recognition tools. We have evaluated the generalization capability of the system in an interesting case study concerning the protein folding problem. In the considered dataset, the entire E. Coli proteome was screened as for the prediction of protein relative solubility on a pure amino acids sequence basis. We report the analysis of the dataset considering different settings, showing interesting test set classification accuracy results. The developed system consents also to extract knowledge from the considered training set, by allowing the analysis of the retrieved information granules.

[1]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[2]  S W Englander,et al.  Chaperonin function: folding by forced unfolding. , 1999, Science.

[3]  C. Dobson,et al.  Inherent toxicity of aggregates implies a common mechanism for protein misfolding diseases , 2002, Nature.

[4]  Alfredo Colosimo,et al.  Nonlinear signal analysis methods in the elucidation of protein sequence-structure relationships. , 2002, Chemical reviews.

[5]  Antonello Rizzi,et al.  Adaptive resolution min-max classifiers , 2002, IEEE Trans. Neural Networks.

[6]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Shoji Takada,et al.  Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins , 2009, Proceedings of the National Academy of Sciences.

[8]  Peter G. Wolynes,et al.  Biomolecules: Where the Physics of Complexity and Simplicity Meet , 1994 .

[9]  C. Branden,et al.  Introduction to protein structure , 1991 .

[10]  Lorenzo Livi,et al.  Graph Recognition by Seriation and Frequent Substructures Mining , 2012, ICPRAM.

[11]  Alexandros Iosifidis,et al.  Multidimensional Sequence Classification Based on Fuzzy Distances and Discriminant Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[12]  Witold Pedrycz,et al.  Granular computing: an introduction , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[13]  Muhammad Abdul Qadir,et al.  Semantic Inconsistency Errors in Ontology , 2007 .

[14]  Philip S. Yu,et al.  Efficient Discovery of Frequent Approximate Sequential Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  James Bailey,et al.  An Efficient Technique for Mining Approximately Frequent Substring Patterns , 2007 .

[16]  Lorenzo Livi,et al.  A new Granular Computing approach for sequences representation and classification , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[17]  Antonello Rizzi,et al.  Online Handwriting Recognition by the Symbolic Histograms Approach , 2007 .

[18]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.