A near pattern-matching scheme based upon principal component analysis

Abstract In this paper, we present an efficient heuristic near pattern-matching scheme. Based upon an important multivariate analysis technique in statistics, called the principal components analysis, we develop algorithms to generate a set of new identifying keys for a given set of patterns to reduce the number of comparisons during the near-matching process. After some preprocessing work, the near-matching operation takes O( n log m ) time in the worst case, where m is the number of identifying segments extracted from the patterns to be searched in a text file of length n .