Common pitfalls in bioinformatics-based analyses: look before you leap.

With the explosion of information in the nucleotide and protein sequencedatabases, biologists are increasinglyturning to web-based bioinformaticsprograms to analyze their molecules ofinterest. Some of these programs have an intuitive interface, whereas othersappear quite complex with severalparameters to choose from beforeanalysis. In either case, it is quite easy tocome to erroneous conclusions about thequestions that are being asked. This canresult from incorrect assumptions on the part of the biologist or because of alimitation of the program or the databasebeing used. In this article we will discusssome of the popular programs that can beused to make predictions in a scenariosuch as the discovery of a novel gene orwhen one is working on a less-characterized molecule. We will elaborateon some of the common pitfalls that canbe avoided if certain precautions aretaken during such analyses.

[1]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[2]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[3]  G. L. Miklos,et al.  Data transferability from model organisms to human beings: insights from the functional genomics of the flightless region of Drosophila. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  A. Mozzarelli,et al.  Crystal structures and inhibitor binding in the octameric flavoenzyme vanillyl-alcohol oxidase: the shape of the active-site cavity controls substrate specificity. , 1997, Structure.

[5]  M. Suyama,et al.  HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project. , 2000, Nucleic acids research.

[6]  W. Huh,et al.  D‐Erythroascorbic acid is an important antioxidant molecule in Saccharomyces cerevisiae , 1998, Molecular microbiology.

[7]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[8]  New experimental and computational approaches to the analysis of gene expression. , 1998, Acta biochimica Polonica.

[9]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[10]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[11]  A. Pandey,et al.  Characterization of a Novel Src-like Adapter Protein That Associates with the Eck Receptor Tyrosine Kinase (*) , 1995, The Journal of Biological Chemistry.

[12]  F. Lewitter,et al.  Nucleotide sequence databases: a gold mine for biologists. , 1999, Trends in biochemical sciences.

[13]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[14]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.