A novel phylogeny-based pattern selection algorithm and its application to microbiomic data

Discriminative patterns describe significant differences between different types of subjects, and often provide insights to critical properties of the problem at hand. Pattern-based classifiers can directly utilize discriminative patterns to predict unseen samples by a majority voting or aggregation mechanism. Therefore, we are concerned with not only finding useful individual patterns, but also the effectiveness of the pattern set as a whole; and it is imperative to ensure the relevancy and non-redundancy of the discriminative patterns. Few studies have evaluated pattern redundancy via examining samples covered by the patterns; and in those that do, the focus has been mostly on the proportion of overlapping samples, suggesting that a great deal of information on non-overlapping samples was overlooked. To address this issue, we present a novel pattern selection algorithm that estimates pattern redundancy by not only the proportion of overlapping samples, but also the resemblance of non-overlapping samples. The proposed method was applied on two real microbiomic datasets, with the aim of providing new insights on the interactions between microbes, and their effects on the host. When compared with other robust classifiers and feature selection heuristics, our pattern selection algorithm led to diverse and compact sets of final patterns that demonstrated comparable or even superior predictive capabilities.

[1]  M Slatkin,et al.  A cladistic measure of gene flow inferred from the phylogenies of alleles. , 1989, Genetics.

[2]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Philip S. Yu,et al.  Efficient Selection of Globally Optimal Rules on Large Imbalanced Data Based on Rule Coverage Relationship Analysis , 2013, SDM.

[5]  R. Knight,et al.  Analysis of the Gut Microbiota in the Old Order Amish and Its Relation to the Metabolic Syndrome , 2012, PloS one.

[6]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[7]  José Francisco Martínez Trinidad,et al.  A survey of emerging patterns for supervised classification , 2012, Artificial Intelligence Review.

[8]  Gösta Grahne,et al.  Fast algorithms for frequent itemset mining using FP-trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  José Francisco Martínez Trinidad,et al.  A New Emerging Pattern Mining Algorithm and Its Application in Supervised Classification , 2010, PAKDD.

[10]  Andrew P. Martin Phylogenetic Approaches for Describing and Comparing the Diversity of Microbial Communities , 2002, Applied and Environmental Microbiology.

[11]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[12]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[13]  Nancy A. Huang,et al.  Microbial abundance patterns of host obesity inferred by the structural incorporation of association measures into interpretable classifiers , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[14]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[15]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[16]  Wayne P. Maddison,et al.  Null models for the number of evolutionary steps in a character on a phylogenetic tree , 1991 .

[17]  G. Srinivas,et al.  Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering , 2013, Nature Communications.