IMMAN: free software for information theory-based chemometric analysis

The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms.Graphical abstractGraphic representation for Shannon’s distribution of MD calculating software.

[1]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  Zheng Rong Yang,et al.  Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR , 2004, J. Chem. Inf. Model..

[5]  Weida Tong,et al.  Mold2, Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics , 2008, J. Chem. Inf. Model..

[6]  M. Teijeira,et al.  GETAWAY descriptors to predicting A(2A) adenosine receptors agonists. , 2005, European journal of medicinal chemistry.

[7]  Jennifer G. Dy Unsupervised Feature Selection , 2007 .

[8]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[9]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[10]  Emmanuel Desurvire,et al.  Classical and Quantum Information Theory: Quantum information theory , 2009 .

[11]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[12]  Maykel Pérez González,et al.  Quantitative structure-activity relationship studies of HIV-1 integrase inhibition. 1. GETAWAY descriptors. , 2007, European journal of medicinal chemistry.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Thomas M. Cover,et al.  The Best Two Independent Measurements Are Not the Two Best , 1974, IEEE Trans. Syst. Man Cybern..

[15]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[16]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[17]  Paola Gramatica,et al.  Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors, 2. Application of the Novel 3D Molecular Descriptors to QSAR/QSPR Studies , 2002, J. Chem. Inf. Comput. Sci..

[18]  Roberto Todeschini,et al.  Molecular descriptors for chemoinformatics , 2009 .

[19]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Johann Gasteiger,et al.  The Coding of the Three-Dimensional Structure of Molecules by Molecular Transforms and Its Application to Structure-Spectra Correlations and Studies of Biological Activity , 1996, J. Chem. Inf. Comput. Sci..

[22]  K. Thangavel,et al.  Unsupervised adaptive floating search feature selection based on Contribution Entropy , 2010, 2010 International Conference on Communication and Computational Intelligence (INCOCCI).

[23]  Johann Gasteiger,et al.  Chemical Information in 3D Space , 1996, J. Chem. Inf. Comput. Sci..

[24]  Maykel Pérez González,et al.  QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. , 2006, Bioorganic & medicinal chemistry.

[25]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[26]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[27]  M. Randic,et al.  MOLECULAR PROFILES NOVEL GEOMETRY-DEPENDENT MOLECULAR DESCRIPTORS , 1995 .

[28]  Anne Mai Wassermann,et al.  Identification of Descriptors Capturing Compound Class-Specific Features by Mutual Information Analysis , 2010, J. Chem. Inf. Model..

[29]  J. Gasteiger,et al.  Finding the 3D structure of a molecule in its IR spectrum , 1997 .

[30]  Michael Frankfurter,et al.  Numerical Recipes In C The Art Of Scientific Computing , 2016 .

[31]  B. Fan,et al.  Molecular similarity and diversity in chemoinformatics: From theory to applications , 2006, Molecular Diversity.

[32]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[33]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[34]  Jürgen Bajorath,et al.  Differential Shannon Entropy Analysis Identifies Molecular Property Descriptors that Predict Aqueous Solubility of Synthetic Compounds with High Accuracy in Binary QSAR Calculations , 2002, J. Chem. Inf. Comput. Sci..

[35]  R. Todeschini,et al.  Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[36]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[37]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[38]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[39]  Harshinder Singh,et al.  Structure-activity models for contact sensitization. , 2005, Chemical research in toxicology.

[40]  Danail Bonchev,et al.  Trends in information theory-based chemical structure codification , 2014, Molecular Diversity.

[41]  Paola Gramatica,et al.  Modeling and prediction by using WHIM descriptors in QSAR studies: submitochondrial particles (SMP) as toxicity blosensors of chlorophenols , 1996 .

[42]  Ž. Jelčić Solvent molecular descriptors on poly(d, l-lactide-co-glycolide) particle size in emulsification–diffusion process , 2004 .

[43]  Jürgen Bajorath,et al.  Chemical Descriptors with Distinct Levels of Information Content and Varying Sensitivity to Differences between Selected Compound Databases Identified by SE-DSE Analysis , 2002, J. Chem. Inf. Comput. Sci..

[44]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[46]  Jürgen Bajorath,et al.  Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations , 2000, J. Chem. Inf. Comput. Sci..