Cross-entropy vs. squared error training: a theoretical and experimental comparison

In this paper we investigate the error criteria that are optimized during the training of artificial neural networks (ANN). We compare the bounds of the squared error (SE) and the crossentropy (CE) criteria being the most popular choices in stateof-the art implementations. The evaluation is performed on automatic speech recognition (ASR) and handwriting recognition (HWR) tasks using a hybrid HMM-ANN model. We find that with randomly initialized weights, the squared error based ANN does not converge to a good local optimum. However, with a good initialization by pre-training, the word error rate of our best CE trained system could be reduced from 30.9% to 30.5% on the ASR, and from 22.7% to 21.9% on the HWR task by performing a few additional “fine-tuning” iterations with the SE criterion.

[1]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[2]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[3]  Douglas Kline,et al.  Revisiting squared-error and cross-entropy functions for training neural network classifiers , 2005, Neural Computing & Applications.

[4]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Hermann Ney,et al.  Moment-Based Image Normalization for Handwritten Text Recognition , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[6]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[7]  Jim Austin,et al.  Learning criteria for training neural network classifiers , 2005, Neural Computing & Applications.

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[10]  Hermann Ney,et al.  On the Relationship between Classification Error Bounds and Training Criteria in Statistical Pattern Recognition , 2003, IbPRIA.

[11]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.