A Probabilistic Approach To Identifying Consensus In Molecular Sequences

Given a profile of nucleic acid bases at a specified position in an aligned set of molecular sequences, a simple rule for defining ambiguity codes is presented: all bases whose frequency in the profile falls below the maximum profile frequency by no more than a specified number d are included in the ambiguity code. Ways are described of defining d so as to ensure that this ‘containing subset’ possesses desirable properties under the assumption of a multinomial model for the frequencies of bases in the profile. The method is illustrated on two data sets, and a discussion is given of its characteristics in terms of some possible properties for consensus methods presented by Day and McMorris (1992a).