Proportional fault-tolerant data mining with applications to bioinformatics

The mining of frequent patterns in databases has been studied for several years, but few reports have discussed for fault-tolerant (FT) pattern mining. FT data mining is more suitable for extracting interesting information from real-world data that may be polluted by noise. In particular, the increasing amount of today’s biological databases requires such a data mining technique to mine important data, e.g., motifs. In this paper, we propose the concept of proportional FT mining of frequent patterns. The number of tolerable faults in a proportional FT pattern is proportional to the length of the pattern. Two algorithms are designed for solving this problem. The first algorithm, named FT-BottomUp, applies an FT-Apriori heuristic and finds all FT patterns with any number of faults. The second algorithm, FT-LevelWise, divides all FT patterns into several groups according to the number of tolerable faults, and mines the content patterns of each group in turn. By applying our algorithm on real data, two reported epitopes of spike proteins of SARS-CoV can be found in our resulting itemset and the proportional FT data mining is better than the fixed FT data mining for this application.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Chin-Feng Lee,et al.  A data mining approach to database compression , 2006, Inf. Syst. Frontiers.

[3]  Jinyan Li,et al.  Mining Temporal Indirect Associations , 2006, PAKDD.

[4]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[5]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[6]  Halim Fathoni,et al.  DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION ENGINEERING , 2008 .

[7]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Yi-Ping Phoebe Chen,et al.  Bioinformatics Technologies , 2005 .

[10]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[11]  M. Bhasin,et al.  Bcipep: A database of B-cell epitopes , 2005, BMC Genomics.

[12]  Igor Jurisica,et al.  Predicting Protein-Protein Interactions by Association Mining , 2006, Inf. Syst. Frontiers.

[13]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[14]  Osmar R. Zaïane,et al.  Mining Positive and Negative Association Rules: An Approach for Confined Rules , 2004, PKDD.

[15]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Geoffrey I. Webb,et al.  Mining Negative Rules Using GRD , 2004, PAKDD.

[19]  Yen-Liang Chen,et al.  A Sampling-Based Method for Mining Frequent Patterns from Databases , 2005, FSKD.

[20]  Sidney Viana,et al.  Matrix Apriori: Speeding Up the Search for Frequent Patterns , 2006, Databases and Applications.

[21]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[22]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[23]  Christian Drosten,et al.  Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome , 2003, Science.

[24]  Fan Wu,et al.  Mining frequent pattern using item-transformation method , 2005, Fourth Annual ACIS International Conference on Computer and Information Science (ICIS'05).

[25]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[26]  Xindong Wu,et al.  Efficient mining of both positive and negative association rules , 2004, TOIS.

[27]  Anthony K. H. Tung,et al.  Fault-Tolerant Frequent Pattern Mining: Problems and Challenges , 2001, DMKD.

[28]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.