Tree Structured Multimedia Signal Modeling

Current solutions to multimedia modeling tasks feature sequential models and static tree-structured models. Sequential models, especially models based on Bidirectional LSTM (BLSTM) and Multilayer LSTM networks, have been widely applied on video, sound, music and text corpora. Despite their success in achieving state-of-the-art results on several multimedia processing tasks, sequential models always fail to emphasize short-term dependency relations, which are crucial in most sequential multimedia data. Tree-structured models are able to overcome this defect. The static tree-structured LSTM presented by Tai et al. (Tai, Socher, and Manning 2015) forcingly breaks down the dependencies between elements in each semantic group and those outside the group, while preserves chain-dependencies among semantic groups and among nodes in the same group. Though the tree-LSTM network is able to better represent the dependency structure of multimedia data, it requires the dependency relations of the input data to be known before it is fed into the network. This is hard to achieve since for most types of multimedia data there exists no parsers which can detect the dependency structure of every input sequence accurately enough. In order to preserve dependency information while eliminating the necessity of a perfect parser, in this paper we present a novel neural network architecture which 1) is self-expandable and 2) maintains the layered dependency structure of incoming multimedia data. We call our new neural network architecture Seq2Tree network. A Seq2Tree model is applicable on classification, prediction and generation tasks with task-specific adjustments of the model. We prove by experiments that our Seq2Tree model performs well in all the three types of tasks. Figure 1: A tree-structured model with three layers. Hidden states of lower-level nodes are inherited from parent nodes. Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Introduction Multimedia signal modeling is the basis of multimedia signal processing tasks across a wide range of research fields such as Computer Vision, Natural Language Processing, and Sound Signal Processing. However on sequential-datarelated research remain not fully developed due to flexibility in structure of sequential data. In this paper, we aim at tackling the problems static neural network solutions have in modeling sequential signals with a dynamically selfadjustable neural network architecture. In sequential data each unit contributes to the prediction of all its following units, so traditional sequential models such as Hidden Markov Model (HMM) and Recurrent Neural Networks (RNN) are the first choice when processing this kind of data. With the ability of weakening gradient vanishment and explosion problems, Long Short Term Memory (LSTM) network has been very popular in sequential data processing tasks. Seq2seq network (Sutskever, Vinyals, and Le 2014), one of the most famous applications of LSTM network for example, has thus attracted much attention in machine translation research (Luong et al. 2015). Its use has also been extended to multiple other tasks such as speech to text conversion (Zhang, Chan, and Jaitly 2017). Moreover, HMM achieves high performance in music style classification task, especially when differentiating composer characteristics (Buzzanca 2002; Chai and Vercoe 2001). However, simple sequential models over single data points can sometimes misrepresent complex multimedia signals. Thus, some variants of the sequential models are introduced. Bidirectional LSTM (BLSTM) network (Schuster and Paliwal 1997) for example, combines two LSTM networks each accepting the input forwards and backwards. Because of the backward LSTM, BLSTM is able to foresee possible boundaries of unit patterns in the future. The current state-of-the-art results in speech and noise separation task, as is reported in the 2nd CHiME challenge (Vincent et al. 2013), is achieved by a BLSTM-based system (Erdogan et al. 2015). Multilayer LSTM is another popular variant of the original LSTM network. This neural network architecture allows different features to sit on different layers of the network, so as to divide a sequence into patterns based on learnt boundary characteristics. As an example, a three-layer Seq2seq model achieves over 95% accuracy in the evaluation of constituent parsers (Vinyals et al. 2015). Nevertheless, both BLSTM and Multilayer LSTM are additive combinations of multiple original LSTM networks which easily lose geometric information of the units. This undermines the performance of systems based on BLSTM and Multilayer LSTM. With careful examination of multiple multimedia data samples we found that most multimedia signals share two characteristics: 1) inner-group dependencies are stronger for every meaningful unit group and 2) meaningful groups form a chain-structured dependency path. These characteristics of temporally successive multimedia data lead us to a natural selection of a tree-structured representation which 1) expands along one direction and 2) branches only when a pattern starts while 3) ends a branch and continues expanding on higher level when it reaches the end of current pattern. This special tree structure satisfies our needs of modeling multimedia signals in terms of meaningful segments but not single units by locating the units of the same semantic group in the same subtree. We call these meaningful unit groups segments. Currently no existing model works for bounding segments with flexible length. To build up such tree structure from sequential input, we highlight the ability of our self-expandable tree model to find boundaries of segments by branching at proper positions. We call this novel neural network architecture Seq2Tree network. For generosity, we fit our tree model to three different types of tasks, namely classification, prediction and generation tasks with necessary modifications to the network. We designed experiments to prove the correctness of the our tree-structured models built by Seq2Tree network and its advancement over the traditional LSTM-based models. Seq2Tree networks were introduced by (Ma et al. 2018). The structure has also been used in AI tasks such as signal processing (Ma et al. 2017). Multimedia Signal Modeling Among all types of multimedia signals we concentrate on temporally successive signals. To be more specific, in this paper we focus on text, video, music and sound signals. We mainly study mainly three types of tasks, namely classification, prediction and generation tasks. Classification Model Figure 2: A tree-structured classification model. Information from every top-level node is summarized into the top node. The hidden state of the top node can be fed into a classifier. As is shown in Figure 2, in the classification model there is one root node above the entire tree structure. The root node incorporates the hidden state of all top-level nodes in the original tree structure. For the classification model we build a softmax classification model by adding a softmax layer on top of the root node subject to the error function: p(y|htop) = softmax(U htop + b), ytop = argmaxyp(y|htop) where U (c) is the classification matrix, b is the bias and htop is the hidden state at the top of the tree. The cost function we choose here is the cross-entropy loss of the predicted label y:

[1]  Barry Vercoe,et al.  Folk Music Classification Using Hidden Markov Models , 2001 .

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  G. Buzzanca A Supervised Learning Approach to Musical Style Recognition , 2002 .

[4]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xiang Li,et al.  Leveraging Dependency Regularization for Event Extraction , 2016, FLAIRS Conference.

[6]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Xiang Li,et al.  Improving Event Detection with Dependency Regularization , 2015, RANLP.

[8]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[9]  Weicheng Ma,et al.  Sound Signal Processing Based on Seq 2 Tree Network , 2017 .

[10]  Xiang Li,et al.  Seq2Tree: A Tree-Structured Extension of LSTM Network , 2017 .

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[13]  Ralph Grishman,et al.  Improving Event Detection with Abstract Meaning Representation , 2015 .

[14]  Jaime G. Carbonell,et al.  Generation from Abstract Meaning Representation using Tree Transducers , 2016, NAACL.

[15]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[16]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.