A practical tool for maximal information coefficient analysis

Abstract Background The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful. Findings Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated. Conclusions We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.

[1]  Isabelle Guyon,et al.  An Introduction to Feature Extraction , 2006, Feature Extraction.

[2]  Naomi S. Altman,et al.  Points of significance: Comparing samples—part II , 2014, Nature Methods.

[3]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[4]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[5]  Daniel S. Murrell,et al.  R2-equitability is satisfiable , 2014, Proceedings of the National Academy of Sciences.

[6]  Michael Mitzenmacher,et al.  Equitability Analysis of the Maximal Information Coefficient, with Comparisons , 2013, ArXiv.

[7]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[8]  Michael Mitzenmacher,et al.  An Empirical Study of Leading Measures of Dependence , 2015, ArXiv.

[9]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[10]  Inanç Birol,et al.  Hive plots - rational approach to visualizing networks , 2012, Briefings Bioinform..

[11]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Naomi S. Altman,et al.  Points of significance: Comparing samples—part I , 2014, Nature Methods.

[13]  Michael Mitzenmacher,et al.  Cleaning up the record on the maximal information coefficient and equitability , 2014, Proceedings of the National Academy of Sciences.

[14]  P. Sham,et al.  A note on the calculation of empirical P values from Monte Carlo procedures. , 2002, American journal of human genetics.

[15]  John D. Storey A direct approach to false discovery rates , 2002 .

[16]  Michael Mitzenmacher,et al.  Measuring Dependence Powerfully and Equitably , 2015, J. Mach. Learn. Res..

[17]  P. Bork,et al.  Tara Oceans. Tara Oceans studies plankton at planetary scale. Introduction. , 2015, Science.

[18]  R. Tibshirani,et al.  Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011 , 2014, 1401.7645.

[19]  David N. Reshef,et al.  Equitability, interval estimation, and statistical power , 2015, Statistical Science.

[20]  P. Bork,et al.  Tara Oceans studies plankton at planetary scale , 2015, Science.

[21]  Cesare Furlanello,et al.  minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers , 2012, Bioinform..

[22]  Francisco M. Cornejo-Castillo,et al.  Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. , 2014, Environmental microbiology.

[23]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[24]  Ron Wehrens,et al.  Multiple comparisons in mass-spectrometry-based -omics technologies , 2013 .

[25]  Malka Gorfine,et al.  Comment on “ Detecting Novel Associations in Large Data Sets ” , 2012 .

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[28]  D. E. Roberts,et al.  The Upper Tail Probabilities of Spearman's Rho , 1975 .