Using hierarchical cluster models to systematically identify groups of jobs with similar occupational questionnaire response patterns to assist rule-based expert exposure assessment in population-based studies.

OBJECTIVES Rule-based expert exposure assessment based on questionnaire response patterns in population-based studies improves the transparency of the decisions. The number of unique response patterns, however, can be nearly equal to the number of jobs. An expert may reduce the number of patterns that need assessment using expert opinion, but each expert may identify different patterns of responses that identify an exposure scenario. Here, hierarchical clustering methods are proposed as a systematic data reduction step to reproducibly identify similar questionnaire response patterns prior to obtaining expert estimates. As a proof-of-concept, we used hierarchical clustering methods to identify groups of jobs (clusters) with similar responses to diesel exhaust-related questions and then evaluated whether the jobs within a cluster had similar (previously assessed) estimates of occupational diesel exhaust exposure. METHODS Using the New England Bladder Cancer Study as a case study, we applied hierarchical cluster models to the diesel-related variables extracted from the occupational history and job- and industry-specific questionnaires (modules). Cluster models were separately developed for two subsets: (i) 5395 jobs with ≥1 variable extracted from the occupational history indicating a potential diesel exposure scenario, but without a module with diesel-related questions; and (ii) 5929 jobs with both occupational history and module responses to diesel-relevant questions. For each subset, we varied the numbers of clusters extracted from the cluster tree developed for each model from 100 to 1000 groups of jobs. Using previously made estimates of the probability (ordinal), intensity (µg m(-3) respirable elemental carbon), and frequency (hours per week) of occupational exposure to diesel exhaust, we examined the similarity of the exposure estimates for jobs within the same cluster in two ways. First, the clusters' homogeneity (defined as >75% with the same estimate) was examined compared to a dichotomized probability estimate (<5 versus ≥5%; <50 versus ≥50%). Second, for the ordinal probability metric and continuous intensity and frequency metrics, we calculated the intraclass correlation coefficients (ICCs) between each job's estimate and the mean estimate for all jobs within the cluster. RESULTS Within-cluster homogeneity increased when more clusters were used. For example, ≥80% of the clusters were homogeneous when 500 clusters were used. Similarly, ICCs were generally above 0.7 when ≥200 clusters were used, indicating minimal within-cluster variability. The most within-cluster variability was observed for the frequency metric (ICCs from 0.4 to 0.8). We estimated that using an expert to assign exposure at the cluster-level assignment and then to review each job in non-homogeneous clusters would require ~2000 decisions per expert, in contrast to evaluating 4255 unique questionnaire patterns or 14983 individual jobs. CONCLUSIONS This proof-of-concept shows that using cluster models as a data reduction step to identify jobs with similar response patterns prior to obtaining expert ratings has the potential to aid rule-based assessment by systematically reducing the number of exposure decisions needed. While promising, additional research is needed to quantify the actual reduction in exposure decisions and the resulting homogeneity of exposure estimates within clusters for an exposure assessment effort that obtains cluster-level expert assessments as part of the assessment process.

[1]  Brian Everitt,et al.  Cluster analysis , 1974 .

[2]  Dong-Uk Park,et al.  Developing estimates of frequency and intensity of exposure to three types of metalworking fluids in a population-based case-control study of bladder cancer. , 2014, American journal of industrial medicine.

[3]  StataCorp Stata multivariate statistics reference manual , 2011 .

[4]  Yu-Cheng Chen,et al.  Comparison of algorithm-based estimates of occupational diesel exhaust exposure to those of multiple independent raters in a population-based case-control study. , 2013, The Annals of occupational hygiene.

[5]  Ujjwal Maulik,et al.  Searching Remote Homology with Spectral Clustering with Symmetry in Neighborhood Cluster Kernels , 2013, PloS one.

[6]  Wouter Fransman,et al.  0195 Carbon nanotube exposure assessment for a study on early biological effects; the CANTES study , 2014, Occupational and Environmental Medicine.

[7]  Patricia A Stewart,et al.  Comparison of two expert-based assessments of diesel exhaust exposure in a case–control study: programmable decision rules versus expert review of individual jobs , 2012, Occupational and Environmental Medicine.

[8]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[9]  D. Glass,et al.  Rule-based exposure assessment versus case-by-case expert assessment using the same information in a community-based study , 2013, Occupational and Environmental Medicine.

[10]  Lin Fritschi,et al.  Sharing the knowledge gained from occupational cohort studies: a call for action , 2012, Occupational and Environmental Medicine.

[11]  Yu-Cheng Chen,et al.  0199 Using machine learning to efficiently use multiple experts to assign occupational lead exposure estimates in a case-control study , 2014, Occupational and Environmental Medicine.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  Margaret R Karagas,et al.  Occupation and bladder cancer in a population-based case–control study in Northern New England , 2010, Occupational and Environmental Medicine.

[14]  Masao Seto,et al.  Genomic Profiling of Oral Squamous Cell Carcinoma by Array-Based Comparative Genomic Hybridization , 2013, PloS one.

[15]  Kai Yu,et al.  Inside the black box: starting to uncover the underlying decision rules used in a one-by-one expert assessment of occupational exposure in case-control studies , 2012, Occupational and Environmental Medicine.

[16]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[17]  Josep Roca,et al.  Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes , 2010, Thorax.

[18]  Nathaniel Rothman,et al.  A Case–Control Study of Occupational Exposure to Trichloroethylene and Non-Hodgkin Lymphoma , 2010, Environmental health perspectives.

[19]  C. Hennig,et al.  How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification , 2013 .

[20]  S Selvin,et al.  Hierarchical cluster analysis for exposure assessment of workers in the Semiconductor Health Study. , 1995, American journal of industrial medicine.

[21]  Lin Fritschi,et al.  Estimated prevalence of exposure to occupational carcinogens in Australia (2011–2012) , 2013, Occupational and Environmental Medicine.

[22]  Melissa C. Friesen,et al.  OccIDEAS: Retrospective Occupational Exposure Assessment in Community-Based Studies Made Easier , 2009, Journal of environmental and public health.

[23]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .