Sequential and Parallel Algorithms for the Generalized Maximum Subarray Problem

The maximum subarray problem (MSP) involves selection of a segment of consecutive array elements that has the largest possible sum over all other segments in a given array. The efficient algorithms for the MSP and related problems are expected to contribute to various applications in genomic sequence analysis, data mining or in computer vision etc. The MSP is a conceptually simple problem, and several linear time optimal algorithms for 1D version of the problem are already known. For 2D version, the currently known upper bounds are cubic or near-cubic time. For the wider applications, it would be interesting if multiple maximum subarrays are computed instead of just one, which motivates the work in the first half of the thesis. The generalized problem of K-maximum subarray involves finding K segments of the largest sum in sorted order. Two subcategories of the problem can be defined, which are K-overlapping maximum subarray problem (K-OMSP), and K-disjoint maximum subarray problem (K-DMSP). Studies on the K-OMSP have not been undertaken previously, hence the thesis explores various techniques to speed up the computation, and several new algorithms. The first algorithm for the 1D problem is of O(Kn) time, and increasingly efficient algorithms of O(K + n logK) time, O((n+K) logK) time and O(n+K logmin(K,n)) time are presented. Considerations on extending these results to higher dimensions are made, which contributes to establishing O(n) time for 2D version of the problem where K is bounded by a certain range. Ruzzo and Tompa studied the problem of all maximal scoring subsequences, whose definition is almost identical to that of the K-DMSP with a few subtle differences. Despite slight differences, their linear time algorithm is readily capable of computing the 1D K-DMSP, but it is not easily extended to higher dimensions. This observation motivates a new algorithm based on the tournament data structure, which is of O(n+K logmin(K,n)) worst-case time. The extended version of the new algorithm is capable of processing a 2D problem in O(n + min(K,n) · n logmin(K,n)) time, that is O(n) for K ≤ n log n . For the 2D MSP, the cubic time sequential computation is still expensive for practical purposes considering potential applications in computer vision and data mining. The second half of the thesis investigates a speed-up option through parallel computation. Previous parallel algorithms for the 2D MSP have huge demand for hardware resources, or their target parallel computation models are in the realm of pure theoretics. A nice compromise between speed and cost can be realized through utilizing a mesh topology. Two mesh algorithms for the 2D MSP with O(n) running time that require a network of size O(n) are designed and analyzed, and various techniques are considered to maximize the practicality to their full potential.

[1]  Tadao Takaoka,et al.  A New Upper Bound on the Complexity of the All Pairs Shortest Path Problem , 1991, Inf. Process. Lett..

[2]  Yaw-Ling Lin,et al.  Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis , 2002, J. Comput. Syst. Sci..

[3]  P. Nordin Genetic Programming III - Darwinian Invention and Problem Solving , 1999 .

[4]  Robert E. Tarjan,et al.  Design and Analysis of a Data Structure for Representing Sorted Lists , 1978, SIAM J. Comput..

[5]  Tadao Takaoka,et al.  Algorithms for the problem of K maximum sums and a VLSI algorithm for the K maximum subarrays problem , 2004, 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings..

[6]  H. T. Kung,et al.  Systolic Arrays for (VLSI). , 1978 .

[7]  Kuan-Yu Chen,et al.  Improved Algorithms for the k Maximum-Sums Problems , 2005, ISAAC.

[8]  Tetsuo Asano,et al.  Polynomial-time solutions to image segmentation , 1996, SODA '96.

[9]  Ellis Horowitz,et al.  Computer Algorithms / C++ , 2007 .

[10]  Jon Louis Bentley Programming pearls: perspective on performance , 1984, CACM.

[11]  S. Kung,et al.  VLSI Array processors , 1985, IEEE ASSP Magazine.

[12]  Jingsen Chen,et al.  A note on ranking k maximum sums , 2005 .

[13]  Jan van Leeuwen,et al.  Worst-case Analysis of Set Union Algorithms , 1984, JACM.

[14]  Tadao Takaoka,et al.  An Efficient VLSI Algorithms for the All Pairs Shortest Path Problem , 1992, J. Parallel Distributed Comput..

[15]  Hsueh-I Lu,et al.  An Optimal Algorithm for Maximum-Sum Segment and Its Application in Bioinformatics Extended Abstract , 2003, CIAA.

[16]  Jeffrey D. Ullman,et al.  Set Merging Algorithms , 1973, SIAM J. Comput..

[17]  Denis Trystram,et al.  Parallel algorithms and architectures , 1995 .

[18]  Franco P. Preparata,et al.  Area-Time Optimal VLSI Networks for Multiplying Matrices , 1980, Inf. Process. Lett..

[19]  Michael J. Flynn,et al.  PROCESSES AND THEIR INTERACTIONS , 1976 .

[20]  Selim G. Akl,et al.  Parallel Maximum Sum Algorithms on Interconnection Networks , 1999 .

[21]  Gerth Stølting Brodal,et al.  Partially Persistent Data Structures of Bounded Degree with Constant Update Time , 1994, Nord. J. Comput..

[22]  K. Design of Special-Purpose VLSI Chips : Example and Opinions , .

[23]  Tadao Takaoka An O(n3loglogn/logn) time algorithm for the all-pairs shortest path problem , 2005, Inf. Process. Lett..

[24]  Hisao Tamaki,et al.  Algorithms for the maximum subarray problem based on matrix multiplication , 1998, SODA '98.

[25]  Alan Bundy,et al.  Constructing Induction Rules for Deductive Synthesis Proofs , 2006, CLASE.

[26]  Robert E. Tarjan,et al.  Making Data Structures Persistent , 1989, J. Comput. Syst. Sci..

[27]  Michael J. Fischer,et al.  An improved equivalence algorithm , 1964, CACM.

[28]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[29]  Andrew Chi-Chih Yao,et al.  Information Bounds Are Weak in the Shortest Distance Problem , 1980, JACM.

[30]  Tadao Takaoka,et al.  Algorithm for K Disjoint Maximum Subarrays , 2006, International Conference on Computational Science.

[31]  Fredrik Bengtsson,et al.  Efficient Algorithms for k Maximum Sums , 2004, ISAAC.

[32]  Narsingh Deo,et al.  Parallel Processing Letters C World Scientiic Publishing Company Parallel Algorithms for Maximum Subsequence and Maximum Subarray , 2022 .

[33]  Tadao Takaoka A Faster Algorithm for the All-Pairs Shortest Path Problem and Its Application , 2004, COCOON.

[34]  Michael J. Fischer,et al.  Efficiency of Equivalence Algorithms , 1972, Complexity of Computer Computations.

[35]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[36]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[37]  David Gries,et al.  A Note on a Standard Strategy for Developing Loop Invariants and Loops , 1982, Sci. Comput. Program..

[38]  Ming-Yang Kao,et al.  Linear-time algorithms for computing maximum-density sequence segments with bioinformatics applications , 2002, J. Comput. Syst. Sci..

[39]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[40]  Greg N. Frederickson,et al.  An Optimal Algorithm for Selection in a Min-Heap , 1993, Inf. Comput..

[41]  Mark Allen Weiss,et al.  Data structures and algorithm analysis in C , 1991 .

[42]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[43]  Fredrik Bengtsson,et al.  Ranking k maximum sums , 2007, Theor. Comput. Sci..

[44]  W Miller,et al.  Locus control regions of mammalian beta-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. , 1997, Gene.

[45]  Tadao Takaoka,et al.  Analysis of air pollution (PM10) and respiratory morbidity rate using K-maximum sub-array (2-D) algorithm , 2007, SAC '07.

[46]  Tadao Takaoka,et al.  Efficient Algorithms for the Maximum Subarray Problem by Distance Matrix Multiplication , 2002, CATS.

[47]  Tadao Takaoka,et al.  Improved Algorithms for the K-Maximum Subarray Problem for Small K , 2005, COCOON.

[48]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[49]  Stephen F. Altschul,et al.  Evaluating the Statistical Significance of Multiple Distinct Local Alignments , 1997 .

[50]  T. Pollard,et al.  Annual review of biophysics and biophysical chemistry , 1985 .

[51]  X. Huang,et al.  An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[52]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[53]  Donald B. Johnson,et al.  Selecting the Kth element in X + Y and X_1 + X_2 + ... + X_m , 1978, SIAM J. Comput..

[54]  Donald B. Johnson,et al.  The Complexity of Selection and Ranking in X+Y and Matrices with Sorted Columns , 1982, J. Comput. Syst. Sci..

[55]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[56]  Ronald L. Rivest,et al.  Expected time bounds for selection , 1975, Commun. ACM.

[57]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[58]  Sridhar Hannenhalli,et al.  Promoter prediction in the human genome , 2001, ISMB.

[59]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[60]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[61]  John R. Koza,et al.  Genetic Programming III: Darwinian Invention & Problem Solving , 1999 .

[62]  Tadao Takaoka,et al.  Improved Algorithms for the K-Maximum Subarray Problem , 2006, Comput. J..

[63]  Kuan-Yu Chen,et al.  On the range maximum-sum segment query problem , 2007, Discret. Appl. Math..

[64]  Val C. Sheffield,et al.  Short tandem repeat polymorphic markers for the rat genome from marker-selected libraries , 1998, Mammalian Genome.

[65]  R. Casadio,et al.  Prediction of the transmembrane regions of β‐barrel membrane proteins with a neural network‐based predictor , 2001, Protein science : a publication of the Protein Society.

[66]  Tadao Takaoka,et al.  Algorithms for k-Disjoint Maximum Subarrays , 2007, Int. J. Found. Comput. Sci..

[67]  Yasuhiko Morimoto,et al.  Data Mining with optimized two-dimensional association rules , 2001, TODS.

[68]  Kuan-Yu Chen,et al.  Improved algorithms for the k maximum-sums problems , 2006, Theor. Comput. Sci..

[69]  Tadao Takaoka,et al.  Algorithms for data mining , 2006 .

[70]  Robert E. Tarjan,et al.  Planar Point Location Using Persistent Search Trees a , 1989 .

[71]  Haim Kaplan,et al.  Purely functional representations of catenable sorted lists , 1996, STOC '96.

[72]  Mark Allen Weiss,et al.  Data structures and algorithm analysis , 1991 .

[73]  Yasuhiko Morimoto,et al.  Computing Optimized Rectilinear Regions for Association Rules , 1997, KDD.

[74]  W. Miller,et al.  Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. , 1999, Nucleic acids research.

[75]  Pranay Chaudhuri Parallel algorithms: design and analysis , 1992 .

[76]  Frances L. Van Scoy The Parallel Recognition of Classes of Graphs , 1980, IEEE Trans. Computers.

[77]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[78]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[79]  Allan Grønlund Jørgensen,et al.  A Linear Time Algorithm for the k Maximal Sums Problem , 2007, MFCS.

[80]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[81]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[82]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[83]  Piero Fariselli,et al.  MaxSubSeq: an algorithm for segment-length optimization. The case study of the transmembrane spanning segments , 2003, Bioinform..

[84]  Bernhard Seeger,et al.  An asymptotically optimal multiversion B-tree , 1996, The VLDB Journal.

[85]  Zhaofang Wen Fast Parallel Algorithms for the Maximum Sum Problem , 1995, Parallel Comput..

[86]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[87]  S Karlin,et al.  Methods and algorithms for statistical analysis of protein sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[89]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[90]  D. T. Lee,et al.  Randomized Algorithm for the Sum Selection Problem , 2005, ISAAC.

[91]  M. J. Quinn,et al.  Parallel Computing: Theory and Practice , 1994 .

[92]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[93]  Tadao Takaoka,et al.  Ranking Cartesian Sums and K-maximum subarrays , 2006 .

[94]  Lech Banachowski,et al.  A Complement to Tarjan's Result about the Lower Bound on the Complexity of the Set Union Problem , 1980, Inf. Process. Lett..