Machine-Learning Algorithms to Code Public Health Spending Accounts

Objectives: Government public health expenditure data sets require time- and labor-intensive manipulation to summarize results that public health policy makers can use. Our objective was to compare the performances of machine-learning algorithms with manual classification of public health expenditures to determine if machines could provide a faster, cheaper alternative to manual classification. Methods: We used machine-learning algorithms to replicate the process of manually classifying state public health expenditures, using the standardized public health spending categories from the Foundational Public Health Services model and a large data set from the US Census Bureau. We obtained a data set of 1.9 million individual expenditure items from 2000 to 2013. We collapsed these data into 147 280 summary expenditure records, and we followed a standardized method of manually classifying each expenditure record as public health, maybe public health, or not public health. We then trained 9 machine-learning algorithms to replicate the manual process. We calculated recall, precision, and coverage rates to measure the performance of individual and ensembled algorithms. Results: Compared with manual classification, the machine-learning random forests algorithm produced 84% recall and 91% precision. With algorithm ensembling, we achieved our target criterion of 90% recall by using a consensus ensemble of ≥6 algorithms while still retaining 93% coverage, leaving only 7% of the summary expenditure records unclassified. Conclusions: Machine learning can be a time- and cost-saving tool for estimating public health spending in the United States. It can be used with standardized public health spending categories based on the Foundational Public Health Services model to help parse public health expenditure information from other types of health-related spending, provide data that are more comparable across public health organizations, and evaluate the impact of evidence-based public health resource allocation.

[1]  Peggy A. Honoré Measuring progress in public health finance. , 2012, Journal of public health management and practice : JPHMP.

[2]  P. Jurka Timothy,et al.  maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification , 2012, R J..

[3]  J. Leider Assessing the Public Health Activity Estimate from the National Health Expenditure Accounts: Why Public Health Expenditure Definitions Matter , 2016 .

[4]  Lana Yeganova,et al.  Topics in machine learning for biomedical literature analysis and text retrieval , 2012, Journal of biomedical semantics.

[5]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[6]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  B. Ripley Classification and Regression Trees , 2015 .

[9]  D. Gans,et al.  Developing a chart of accounts: historical perspective of the Medical Group Management Association. , 2007, Journal of public health management and practice : JPHMP.

[10]  J. Leider,et al.  Taking a Step Forward in Public Health Finance: Establishing Standards for a Uniform Chart of Accounts Crosswalk. , 2015, Journal of Public Health Management and Practice.

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  A. Sensenig Refining estimates of public health spending as measured in national health expenditures accounts: the United States experience. , 2007, Journal of public health management and practice : JPHMP.