Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies.

OBJECTIVES Computer-assisted coding of job descriptions to standardized occupational classification codes facilitates evaluating occupational risk factors in epidemiologic studies by reducing the number of jobs needing expert coding. We evaluated the performance of the 2nd version of SOCcer, a computerized algorithm designed to code free-text job descriptions to US SOC-2010 system based on free-text job titles and work tasks, to evaluate its accuracy. METHODS SOCcer v2 was updated by expanding the training data to include jobs from several epidemiologic studies and revising the algorithm to account for nonlinearity and incorporate interactions. We evaluated the agreement between codes assigned by experts and the highest scoring code (a measure of confidence in the algorithm-predicted assignment) from SOCcer v1 and v2 in 14,714 jobs from three epidemiology studies. We also linked exposure estimates for 258 agents in the job-exposure matrix CANJEM to the expert and SOCcer v2-assigned codes and compared those estimates using kappa and intraclass correlation coefficients. Analyses were stratified by SOCcer score, score distance between the top two scoring codes from SOCcer, and features from CANJEM. RESULTS SOCcer's v2 agreement at the 6-digit level was 50%, compared to 44% in v1, and was similar for the three studies (38%-45%). Overall agreement for v2 at the 2-, 3-, and 5-digit was 73%, 63%, and 56%, respectively. For v2, median ICCs for the probability and intensity metrics were 0.67 (IQR 0.59-0.74) and 0.56 (IQR 0.50-0.60), respectively. The agreement between the expert and SOCcer assigned codes linearly increased with SOCcer score. The agreement also improved when the top two scoring codes had larger differences in score. CONCLUSIONS Overall agreement with SOCcer v2 applied to job descriptions from North American epidemiologic studies was similar to the agreement usually observed between two experts. SOCcer's score predicted agreement with experts and can be used to prioritize jobs for expert review.

[1]  C. Peek-Asa,et al.  Determining occupation for National Violent Death Reporting System records: An evaluation of autocoding programs. , 2021, American journal of industrial medicine.

[2]  Matthias Schonlau,et al.  Machine Learning for Occupation Coding—a Comparison Study , 2020, Journal of Survey Statistics and Methodology.

[3]  Christopher J O Baker,et al.  Occupation Coding of Job Titles: Iterative Development of an Automated Coding Algorithm for the Canadian National Occupation Classification (ACA-NOC) , 2020, JMIR formative research.

[4]  Ann Marie Dale,et al.  Efficiency of autocoding programs for converting job descriptors into standard occupational classification (SOC) codes , 2018, American journal of industrial medicine.

[5]  France Labrèche,et al.  Development of and Selected Performance Characteristics of CANJEM, a General Population Job-Exposure Matrix Based on Past Expert Assessments of Exposure , 2018, Annals of work exposures and health.

[6]  J. Siemiatycki,et al.  Availability of a New Job-Exposure Matrix (CANJEM) for Epidemiologic and Occupational Medicine Purposes , 2018, Journal of occupational and environmental medicine.

[7]  Jack Siemiatycki,et al.  Development of a Coding and Crosswalk Tool for Occupations and Industries , 2018, Annals of work exposures and health.

[8]  Frauke Kreuter,et al.  Occupation coding during the interview , 2018 .

[9]  J. Rees,et al.  Capture and coding of industry and occupation measures: Findings from eight National Program of Cancer Registries states , 2017, American journal of industrial medicine.

[10]  N. Allen,et al.  Occupational self-coding and automatic recording (OSCAR): a novel web-based tool to collect and code lifetime job histories in large population-based studies. , 2017, Scandinavian journal of work, environment & health.

[11]  Kea Tijdens,et al.  Measuring and Detecting Errors in Occupational Coding: an Analysis of SHARE Data , 2016 .

[12]  Sarah J. Locke,et al.  Occupational exposure to chlorinated solvents and kidney cancer: a case–control study , 2016, Occupational and Environmental Medicine.

[13]  Kwan-Yuet Ho,et al.  Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies , 2016, Occupational and Environmental Medicine.

[14]  Shuangge Ma,et al.  Occupation and Thyroid Cancer: A Population-Based, Case-Control Study in Connecticut , 2016, Journal of occupational and environmental medicine.

[15]  Linda Forst,et al.  Industry and Occupation in the Electronic Health Record: An Investigation of the National Institute for Occupational Safety and Health Industry and Occupation Computerized Coding System , 2016, JMIR medical informatics.

[16]  Shuangge Ma,et al.  Diagnostic radiography exposure increases the risk for thyroid microcarcinoma: a population-based case–control study , 2015, European journal of cancer prevention : the official journal of the European Cancer Prevention Organisation.

[17]  Kwan-Yuet Ho,et al.  Computer-Based Coding of Occupation Codes for Epidemiological Analyses , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[18]  Yuan An,et al.  Beyond crosswalks: reliability of exposure assessment following automated coding of free-text job descriptions for occupational epidemiology. , 2014, The Annals of occupational hygiene.

[19]  V. Howard,et al.  Methods and feasibility of collecting occupational data for a large population-based cohort study in the United States: the reasons for geographic and racial differences in stroke study , 2014, BMC Public Health.

[20]  H. Bang,et al.  Performance of automated and manual coding systems for occupational data: a case study of historical records. , 2012, American journal of industrial medicine.

[21]  Ralph DiGaetano,et al.  Hypertension and Risk of Renal Cell Carcinoma Among White and Black Americans , 2011, Epidemiology.

[22]  S. Milham,et al.  A computer system for coding occupation. , 2006, American journal of industrial medicine.

[23]  J. Siemiatycki,et al.  Risk of lung cancer following nonmalignant respiratory conditions: evidence from two case-control studies in Montreal, Canada. , 2006, Lung cancer.

[24]  H Kromhout,et al.  The use of occupation and industry classifications in general population studies. , 2003, International journal of epidemiology.

[25]  M. Dosemeci,et al.  Occupational risk factors for cancer of the central nervous system (CNS) among US women. , 1999, American journal of industrial medicine.

[26]  M. Dosemeci,et al.  Occupational risk factors for cancer of the gastric cardia. Analysis of death certificates from 24 US states. , 1998, Journal of occupational and environmental medicine.

[27]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[28]  J. Siemiatycki,et al.  Obtaining occupational exposure histories in epidemiologic case-control studies. , 1985, Journal of occupational medicine. : official publication of the Industrial Medical Association.

[29]  M. Gerin Recent approaches to retrospective exposure assessment in occupational cancer epidemiology. , 1990, Recent results in cancer research. Fortschritte der Krebsforschung. Progres dans les recherches sur le cancer.