First International Workshop on Mining Multiple Information Sources

This paper demonstrates how methods borrowed from information fusion can improve the performance of a classifier by constructing (“fusing”) new features that are combinations of existing numeric features. This work is an example of local pattern analysis and fusion because it identifies potentially useful patterns (i.e., feature combinations) from a single data source. In our work, we fuse features by mapping the numeric values for each feature to a rank and then averaging these ranks. The quality of the fused features is measured with respect to how well they classify minority-class examples, which makes this method especially effective for dealing with data sets that exhibit class imbalance. This paper evaluates our combinatorial feature fusion method on ten data sets, using three learning methods. The results indicate that our method can be quite effective in improving classifier performance, although it seems to improve the performance of some learning methods more than others. General Terms Algorithms, Performance, Experimentation

[1]  David M. Lin,et al.  Effective similarity measures for expression profiles , 2006, Bioinform..

[2]  Tommi S. Jaakkola,et al.  Continuous Representations of Time-Series Gene Expression Data , 2003, J. Comput. Biol..

[3]  Francisco Azuaje,et al.  A knowledge-driven approach to cluster validity assessment , 2005, Bioinform..

[4]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[5]  Weiqi Wang,et al.  Gene ontology friendly biclustering of expression profiles , 2004 .

[6]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[7]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[8]  Padhraic Smyth,et al.  Gene Expression Clustering with Functional Mixture Models , 2003, NIPS.

[9]  Heikki Mannila,et al.  Prediction with local patterns using cross-entropy , 1999, KDD '99.

[10]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[11]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[12]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[15]  Zhi-Hua Zhou,et al.  Ensembling MML Causal Discovery , 2004, PAKDD.

[16]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[17]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Andrew W. Moore,et al.  Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning , 2003, ICML.

[19]  Björn Olsson,et al.  Using functional annotation to improve clusterings of gene expression patterns , 2002, Inf. Sci..

[20]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[21]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[22]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[24]  Samuel Kaski,et al.  Clustering Gene Expression Data by Mutual Information with Gene Function , 2001, ICANN.

[25]  Samah Jamal Fodeh,et al.  Frequent Closed Itemset Mining Using Prefix Graphs with an Efficient Flow-Based Pruning Strategy , 2006, Sixth International Conference on Data Mining (ICDM'06).

[26]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[27]  Nir Friedman,et al.  On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks , 1999, AISTATS.

[28]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[30]  Szymon Jaroszewicz,et al.  Pruning Redundant Association Rules Using Maximum Entropy Principle , 2002, PAKDD.

[31]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[32]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[33]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[34]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[35]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[36]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[37]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[38]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[39]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[40]  Vipin Kumar,et al.  RBA: An Integrated Framework for Regression based on Association Rules , 2004, SDM.

[41]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[42]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[43]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[44]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[45]  Kevin Crowston,et al.  FLOSSmole: A Collaborative Repository for FLOSS Research Data and Analyses , 2006, Int. J. Inf. Technol. Web Eng..

[46]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[47]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[48]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[49]  Jesús M. González-Barahona,et al.  Developer identification methods for integrated data from various sources , 2005, ACM SIGSOFT Softw. Eng. Notes.

[50]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[51]  Jose Miguel Puerta,et al.  Graphical Models to Causal Discovery from Data , 2002, Probabilistic Graphical Models.

[52]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[53]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[54]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[55]  Howard J. Hamilton,et al.  Evaluation of Interestingness Measures for Ranking Discovered Knowledge , 2001, PAKDD.

[56]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[57]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[58]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[59]  Obi L. Griffith,et al.  Discovering significant OPSM subspace clusters in massive gene expression data , 2006, KDD '06.

[60]  Shichao Zhang,et al.  Mining Multiple Data Sources: Local Pattern Analysis , 2006, Data Mining and Knowledge Discovery.

[61]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[62]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[63]  Tommi S. Jaakkola,et al.  Bias-Corrected Bootstrap and Model Uncertainty , 2003, NIPS.

[64]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[65]  Kyuseok Shim,et al.  Mining optimized support rules for numeric attributes , 2001, Inf. Syst..

[66]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .