Application of a MAX-CUT Heuristic to the Contig Orientation Problem in Genome Assembly

In the context of genome assembly, the contig orientation problem is described as the problem of removing sufficient edges from the scaffold graph so that the remaining subgraph assigns a consistent orientation to all sequence nodes in the graph. This problem can also be phrased as a weighted MAX-CUT problem. The performance of MAX-CUT heuristics in this application is untested. We present a greedy heuristic solution to the contig orientation problem and compare its performance to a weighted MAX-CUT semi-definite programming heuristic solution on several graphs. We note that the contig orientation problem can be used to identify inverted repeats and inverted haplotypes, as these represent sequences whose orientation appears ambiguous in the conventional genome assembly framework.

[1]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[2]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[3]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[4]  Nilgun Donmez,et al.  SCARPA: scaffolding reads with practical algorithms , 2013, Bioinform..

[5]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[6]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[7]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[8]  David P. Williamson,et al.  .879-approximation algorithms for MAX CUT and MAX 2SAT , 1994, STOC '94.

[9]  Guy Kindler,et al.  Optimal inapproximability results for MAX-CUT and other 2-variable CSPs? , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[10]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[11]  J. Mol,et al.  Role of inverted DNA repeats in transcriptional and post-transcriptional gene silencing , 2000, Plant Molecular Biology.

[12]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  G M Rubin,et al.  Identification and purification of a Drosophila protein that binds to the terminal 31-base-pair inverted repeats of the P transposable element. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[15]  Franz Rendl,et al.  Solving Max-Cut to optimality by intersecting semidefinite and polyhedral relaxations , 2009, Math. Program..

[16]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[17]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[18]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[19]  Zhaoshi Jiang,et al.  Characterization of six human disease-associated inversion polymorphisms , 2009, Human molecular genetics.

[20]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[21]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[22]  V. Bafna,et al.  Evidence for large inversion polymorphisms in the human genome from HapMap data. , 2007, Genome research.

[23]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[24]  Zhaoshi Jiang,et al.  Evolutionary toggling of the MAPT 17q21.31 inversion region , 2008, Nature Genetics.

[25]  Adel Dayarian,et al.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization , 2010, BMC Bioinformatics.