Machine learning for sub-population assessment: evaluating the C-section rate of different physician practices

We apply machine learning to the problem of subpopulation assessment for Caesarian Section. In subpopulation assessment, we are interested in making predictions not for a single patient, but for groups of patients. Typically, in any large population, different subpopulations will have different "outcome" rates. In our example, the C-section rate of a population of 22,176 expectant mothers is 16.8%; yet, the 17 physician groups that serve this population have vastly different group C-section rates, ranging from 11% to 23%. The ultimate goal of subpopulation assessment is to determine if these variations in the observed rates can be attributed to (a) variations in intrinsic risk of the patient sub-populations (i.e. some groups contain more "high-risk C-section" patients), or (b) differences in physician practice (i.e. some groups do more C-sections). Our results indicate that although there is some variation in intrinsic risk, there is also much variation in physician practice.