A computer-based coding scheme for research data 1,2

This paper addresses the need to reduce the complexity of data encoding and error rates in studies using multiple data bases composed of hierarchical file structures. It describes a coding scheme to represent long alphanumeric values, indicates the efficiency of such a scheme, and discusses error-rate reduction. Several approaches are available that minimize coding errors. Numeric codes with embedded information allocated to positions within the value codes are widely used, but these are unacceptable for variables with many values or levels of classification. Such “smart” codes require the full knowledge of the universe the variables describe, as well as the potential classification schemes for each variable. We have found that codes without embedded information (“nonsense” codes), circumvent the problems associated with smart codes. With nonsense codes, alphanumeric variable values are assigned a sequential numeric code as new values are encountered in the data base, irrespective of their value in the alphanumeric sequence. Nonsense codes are especially useful when the data base is first developing, since knowledge of the number of classification levels for each of the variables is not necessary.