论文信息 - Harnessing Unlabeled Examples through Iterative Application of Dynamic Markov Modeling

Harnessing Unlabeled Examples through Iterative Application of Dynamic Markov Modeling

We describe the application of dynamic Markov modeling - a sequential bit-wise prediction technique - to labeling email cor- pora for the 2006 ECML/PKDD Discovery Challenge. Our technique involves: (1) converting the corpora's bag-of-words representation to a sequence of bits; (2) using logistic regression on the training data to induce an initial maximum likelihood classifier; (2) combining all test sets into one; (3) ordering the combined set by decreasing magnitude of the log-likelihood ratio; (4) iteratively applying dynamic Markov model- ing (DMC) to compute successive log-likelihood estimates; (5) averaging successive estimates to form an overall estimate; (6) partitioning the combined estimates into separate results for each test set. Post-hoc ex- periments showed that: (a) the iterative process improved on the initial classifier in almost all cases; (b) treating each test set separately yielded nearly indistinguishable results.

Gordon V. Cormack

[1] Gordon V. Cormack,et al. On-line spam filter fusion , 2006, SIGIR.

[2] R. Nigel Horspool,et al. Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[3] T. Wood. The discovery challenge , 2005 .

[4] Joshua Goodman,et al. Online Discriminative Spam Filter Training , 2006, CEAS.

[5] Blaz Zupan,et al. Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..