DeepTC - An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments

We present a deep learning extension for the multi-purpose text classification framework DKPro Text Classification (DKPro TC). DKPro TC is a flexible framework for creating easily shareable and reproducible end-to-end NLP experiments involving machine learning. We provide an overview of the current state of DKPro TC, which does not allow integration of deep learning, and discuss the necessary conceptual extensions. These extensions are based on an analysis of common deep learning setups found in the literature to support all common text classification setups, i.e. single outcome, multi outcome, and sequence classification problems. Additionally to providing an end-to-end shareable environment for deep learning experiments, we provide convenience features that take care of repetitive steps, such as pre-processing, data vectorization and pruning of embeddings. By moving a large part of this boilerplate code into DKPro TC, the actual deep learning framework code improves in readability and lowers the amount of redundant source code considerably. As proof-of-concept, we integrate Keras, DyNet, and DeepLearning4J.

[1]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[2]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[5]  Iryna Gurevych,et al.  A lightweight framework for reproducible parameter sweeping in information retrieval , 2011, DESIRE '11.

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[8]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[9]  Jinho D. Choi Dynamic Feature Induction: The Last Gist to the State-of-the-Art , 2016, NAACL.

[10]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[13]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[14]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[15]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[18]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[19]  Oliver Ferschke,et al.  DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data , 2014, ACL.

[20]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[23]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.