We investigate the problem of classifying individuals based on estimated density functions for each individual. Given labelled histograms characterizing red blood cells (RBCs) for di erent individuals, the learning problem is to build a classi er which can classify new unlabelled histograms into normal and iron de cient classes. Thus, the problem is similar to conventional classi cation in that there is labelled training data, but di erent in that the underlying measurements are not feature vectors but histograms or density estimates. We describe a general framework based on probabilistic hierarchical models for modelling such data and illustrate how the model lends itself to classi cation. We contrast this approach with two other alternatives: (1) directly de ning distance between densities using a cross-entropy distance measure, and (2) using parameters of the estimated densities as feature vectors for a standard discriminative classi cation framework. We evaluate all three methods on a real-world data set consisting of 180 subjects. The hierarchical modeling and density-distance approaches are most accurate, yielding cross-validated error rates in the range of 1 to 2%. We conclude by discussing the relative merits of each approach, including the interpretability of each model from a clinical diagnostic viewpoint.
[1]
Tom Heskes,et al.
Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach
,
1998,
ICML.
[2]
C. McLaren,et al.
Detection of two-component mixtures of lognormal distributions in grouped, doubly truncated data: analysis of red blood cell volume distributions.
,
1991,
Biometrics.
[3]
G. McLachlan,et al.
Fitting mixture models to grouped and truncated data via the EM algorithm.
,
1988,
Biometrics.
[4]
C. McLaren.
Mixture models in haematology: a series of case studies
,
1996,
Statistical methods in medical research.
[5]
Tom M. Mitchell,et al.
Improving Text Classification by Shrinkage in a Hierarchy of Classes
,
1998,
ICML.
[6]
G. McLachlan,et al.
Algorithm AS 254: maximum likelihood estimation from grouped and truncated data with finite normal mixture models
,
1990
.
[7]
V. Hasselblad,et al.
Analysis of the volume of red blood cells: application of the expectation-maximization algorithm to grouped data from the doubly-truncated lognormal distribution.
,
1986,
Biometrics.