Conditional Random Fields for Word Hyphenation

Word hyphenation is an important problem which has many practical applications. The problem is challenging because of the vast amount of English words. We use linear-chain Conditional Random Fields (CRFs) that has efficient algorithms to learn and to predict hyphen of English words that do not appear in the training dictionary. In this report, we are interested in finding 1) an efficient optimization technique to learn linear-chain CRFs model and 2) a good feature representation for word hyphenation. We compare the convergence time of three optimization techniques 1) Collins Perceptron; 2) Contrastive Divergence; 3) limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS). We design two feature representation 1) relative binary encoding (RBE) and 2) absolute binary encoding (ABE) and compare their performance. The experiment results show that Collins Perceptron is the most efficient method for training linear-chain CRFs and ABE is a better feature representation scheme that outperforms RBE by 7.9% accuracy. We show our design is reasonable by comparing it to the state-of-the-art [2] which outperforms this work only by 4.66% accuracy.