论文信息 - First International Workshop on Mining Multiple Information Sources

First International Workshop on Mining Multiple Information Sources

This paper demonstrates how methods borrowed from information fusion can improve the performance of a classifier by constructing (“fusing”) new features that are combinations of existing numeric features. This work is an example of local pattern analysis and fusion because it identifies potentially useful patterns (i.e., feature combinations) from a single data source. In our work, we fuse features by mapping the numeric values for each feature to a rank and then averaging these ranks. The quality of the fused features is measured with respect to how well they classify minority-class examples, which makes this method especially effective for dealing with data sets that exhibit class imbalance. This paper evaluates our combinatorial feature fusion method on ten data sets, using three learning methods. The results indicate that our method can be quite effective in improving classifier performance, although it seems to improve the performance of some learning methods more than others. General Terms Algorithms, Performance, Experimentation

[1] David M. Lin,et al. Effective similarity measures for expression profiles , 2006, Bioinform..

[2] Tommi S. Jaakkola,et al. Continuous Representations of Time-Series Gene Expression Data , 2003, J. Comput. Biol..

[3] Francisco Azuaje,et al. A knowledge-driven approach to cluster validity assessment , 2005, Bioinform..

[4] Abraham Silberschatz,et al. What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[5] Weiqi Wang,et al. Gene ontology friendly biclustering of expression profiles , 2004 .

[6] Charu C. Aggarwal,et al. On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[7] Geoffrey I. Webb. Discovering significant rules , 2006, KDD '06.

[8] Padhraic Smyth,et al. Gene Expression Clustering with Functional Mixture Models , 2003, NIPS.

[9] Heikki Mannila,et al. Prediction with local patterns using cross-entropy , 1999, KDD '99.

[10] Michael Ruogu Zhang,et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[11] Andrew McCallum,et al. Using Maximum Entropy for Text Classification , 1999 .

[12] Robert L. Mercer,et al. Adaptive language modeling using minimum discriminant estimation , 1992 .

[13] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14] George M. Church,et al. Biclustering of Expression Data , 2000, ISMB.

[15] Zhi-Hua Zhou,et al. Ensembling MML Causal Discovery , 2004, PAKDD.

[16] Nimrod Megiddo,et al. Discovering Predictive Association Rules , 1998, KDD.

[17] Byung-Won On,et al. Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18] Andrew W. Moore,et al. Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning , 2003, ICML.

[19] Björn Olsson,et al. Using functional annotation to improve clusterings of gene expression patterns , 2002, Inf. Sci..

[20] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[21] Charu C. Aggarwal,et al. Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[22] Jian Pei,et al. CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[24] Samuel Kaski,et al. Clustering Gene Expression Data by Mutual Information with Gene Function , 2001, ICANN.

[25] Samah Jamal Fodeh,et al. Frequent Closed Itemset Mining Using Prefix Graphs with an Efficient Flow-Based Pruning Strategy , 2006, Sixth International Conference on Data Mining (ICDM'06).

[26] Thomas G. Dietterich. Machine-Learning Research , 1997, AI Mag..

[27] Nir Friedman,et al. On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks , 1999, AISTATS.

[28] Ronald Rosenfeld,et al. Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29] Erhard Rahm,et al. A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[30] Szymon Jaroszewicz,et al. Pruning Redundant Association Rules Using Maximum Entropy Principle , 2002, PAKDD.

[31] Anthony C. Davison,et al. Bootstrap Methods and Their Application , 1998 .

[32] Alan R. Powell,et al. Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[33] Jiawei Han,et al. Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[34] Rajeev Motwani,et al. Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[35] Jeffrey M. Hausdorff,et al. Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[36] Hans-Peter Kriegel,et al. Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[37] Srinivasan Parthasarathy,et al. Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[38] Ramakrishnan Srikant,et al. Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[39] Richard M. Karp,et al. Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[40] Vipin Kumar,et al. RBA: An Integrated Framework for Regression based on Association Rules , 2004, SDM.