Alns: a new searchable and filterable sequence alignment format

Nucleotides and amino acids are basic building units of RNA, DNA and protein. Although intensive studies on understanding how changes in these building blocks affect the phenotypes of these biopolymers are ever increasing, many popular alignment formats are generated by pair-wise comparison tools such as the Basic Local Alignment Search Tool (BLAST). These alignments are user-friendly to researchers but are not convenient for searching, filtering and storage, in particular when there are thousands of alignments generated from highly conserved sequences. Here, we introduce a new alignment format, alns, to facilitate rapid and convenient association of genetic changes and similarity to other sources of information such as phenotypes, disease state, time, geography and taxonomy via simple spreadsheet functions. The format shall assist biologists from a wide range of disciplines in knowledge discovery.

[1]  Pak Sham,et al.  Parental phenotypes in family-based association analysis. , 2005, American journal of human genetics.

[2]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  Ka Hou Chu,et al.  Rapid DNA barcoding analysis of large datasets using the composition vector method , 2009, BMC Bioinformatics.

[5]  Dan A. Simovici,et al.  Several remarks on the metric space of genetic codes , 2012, Int. J. Data Min. Bioinform..

[6]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[7]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[8]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[9]  Paul Horton,et al.  Parameters for accurate genome alignment , 2010, BMC Bioinformatics.

[10]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[11]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[12]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[13]  Francis Y. L. Chin,et al.  An efficient motif discovery algorithm with unknown motif length and number of binding sites , 2006, Int. J. Data Min. Bioinform..

[14]  Anne M. Denton,et al.  Clustering sequences by overlap , 2009, Int. J. Data Min. Bioinform..

[15]  Chuong B Do,et al.  Protein multiple sequence alignment. , 2008, Methods in molecular biology.

[16]  Sarala M. Wimalaratne,et al.  The Systems Biology Graphical Notation , 2009, Nature Biotechnology.

[17]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[18]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[19]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[20]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.