AL4LA: Active Learning for Text Labeling Based on Paragraph Vectors

Nowadays, despite the huge amount of digitized information, the biggest drawback to use machine learning in text mining is the lack of availability of a set of tagged data due to mainly, that it requires a great user effort that it is not always viable. In this paper, with the aim of reducing the great workload required to manually processing the contents of large volumes of documents, we present a methodology based on probabilistic inference and active learning to label documents in Spanish using a semi-supervised approach. First, a vector representation of the documents is generated, and then an interactive learning process to apply both, automatic and manual labeling is proposed. To evaluate the accuracy of the predictions and the efficiency of the methodology, different configurations regarding the automatic and manual labeling processes have been studied. The proposed methodology reduces the need for a large corpus of manually labeled texts by introducing a self-labeling process during training. We have shown that both tagging approaches can be combined maintaining accuracy and reducing user intervention.

[1]  Metin Bilgin,et al.  Sentiment analysis on Twitter data with semi-supervised Doc2Vec , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[2]  Siddharth Patwardhan,et al.  The Role of Context Types and Dimensionality in Learning Word Embeddings , 2016, NAACL.

[3]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[4]  Michael Sedlmair,et al.  More than Bags of Words: Sentiment Analysis with Word Embeddings , 2018 .

[5]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[6]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[7]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[8]  Burr Settles,et al.  From Theories to Queries: Active Learning in Practice , 2011 .

[9]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[10]  Erik Cambria,et al.  Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article] , 2014, IEEE Computational Intelligence Magazine.

[11]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[12]  Claudia-Lavinia Ignat,et al.  Quality assessment of wikipedia articles: a deep learning approach by Quang Vinh Dang and Claudia-Lavinia Ignat with Martin Vesely as coordinator , 2016, SIGWEB Newsl..

[13]  Vsevolod Potapov,et al.  Neural Network Doc2vec in Automated Sentiment Analysis for Short Informal Texts , 2017, SPECOM.

[14]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[15]  José M. F. Moura,et al.  VisualWord2Vec (Vis-W2V): Learning Visually Grounded Word Embeddings Using Abstract Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Yonghong Yan,et al.  Distributional Representations of Words for Short Text Classification , 2015, VS@HLT-NAACL.

[18]  Helena Gómez-Adorno,et al.  Author Profiling with Doc2vec Neural Network-Based Document Embeddings , 2016, MICAI.