Input/Output Deep Architecture for Structured Output Problems

Pre-training of input layers has shown to be efficient for learning deep architectures, solving the vanishing gradient issues. In this paper, we propose to extend the use of pre-training to output layers in order to address structured output problems, which are characterized by internal dependencies between the outputs (e.g. the classes of pixels in an image labeling problem). Whereas the output structure is generally modeled using graphical models, we propose a fully neural-based model called IODA (Input Output Deep Architecture) that learns both input and output dependencies. We apply IODA on facial landmark detection problems where the output is a strongly structured regression vector. We evaluate IODA on two public challenging datasets: LFPW and HELEN. We show that IODA outperforms the traditional pre-training approach.

[1]  Romain Hérault,et al.  IODA: An input/output deep architecture for image labeling , 2015, Pattern Recognit..

[2]  Mounim A. El-Yacoubi,et al.  A Statistical Approach for Phrase Location and Recognition within a Text Line: An Application to Street Name Recognition , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[4]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  David J. Kriegman,et al.  Localizing Parts of Faces Using a Consensus of Exemplars , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Stefanos Zafeiriou,et al.  A Semi-automatic Methodology for Facial Landmark Annotation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Shiguang Shan,et al.  Cascaded Shape Space Pruning for Robust Facial Landmark Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Luc Van Gool,et al.  Real-time facial feature detection using conditional regression forests , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[11]  M. Fridman Hidden Markov model regression , 1993 .

[12]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Shiguang Shan,et al.  Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment , 2014, ECCV.

[17]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Junzhou Huang,et al.  Pose-Free Facial Landmark Fitting via Optimized Part Mixtures and Cascaded Deformable Shape Model , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[20]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[21]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[22]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[23]  Mark Craven,et al.  Learning Hidden Markov Models for Regression using Path Aggregation , 2008, UAI.

[24]  Feng Zhou,et al.  Exemplar-Based Graph Matching for Robust Facial Landmark Localization , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Yuan Qi,et al.  Contextual recognition of hand-drawn diagrams with conditional random fields , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[26]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[27]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[28]  Umar Syed,et al.  Enzyme function prediction with interpretable models. , 2009, Methods in molecular biology.

[29]  PanticMaja,et al.  300 Faces In-The-Wild Challenge , 2016 .

[30]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  Thomas S. Huang,et al.  Interactive Facial Feature Localization , 2012, ECCV.

[33]  Helmut Schmid,et al.  Part-of-Speech Tagging With Neural Networks , 1994, COLING.

[34]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[35]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[36]  Stefanos Zafeiriou,et al.  300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[37]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[38]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[39]  Jianhua Wang,et al.  Coupling CRFs and Deformable Models for 3D Medical Image Segmentation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[40]  Thierry Paquet,et al.  A Markovian Approach for Handwritten Document Segmentation , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[41]  Timothy F. Cootes,et al.  Feature Detection and Tracking with Constrained Local Models , 2006, BMVC.

[42]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.