Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing

We explore new methods of improving Curriculum Vitae (CV) parsing for German documents by applying recent research on the application of word embeddings in Natural Language Processing (NLP). Our approach integrates the word embeddings as input features for a probabilistic sequence labeling model that relies on the Conditional Random Field (CRF) framework. Best-performing word embeddings are generated from a large sample of German CVs. The best results on the extraction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features. The improvements are consistent throughout different sections of the target documents. The effect of the word embeddings is strongest on semi-structured, out-of-sample data.

[1]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Arzucan Özgür,et al.  Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings , 2014, 2014 13th International Conference on Machine Learning and Applications.

[4]  Kun Yu,et al.  Resume Information Extraction with Cascaded Hybrid Model , 2005, ACL.

[5]  Christopher D. Manning,et al.  Effect of Non-linear Deep Architecture in Sequence Labeling , 2013, IJCNLP.

[6]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[7]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[8]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[9]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[10]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[11]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[12]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.