Supplementary: Hierarchical Video Prediction using Relational Layouts for Human-Object Interactions

Creating object maps. The output of the relational layout generation phase (i.e, stage 1) is a sequence of objects and poses. The pose outputs are 2D maps of shape 128× 128×Nkp for each timestep. Whereas the object outputs are 1D boxes of shape 1× 4 for each box at each timestep. After the first stage, the objects are converted into 2D maps before using them as inputs in the second stage for video generation. We do this by first initializing a tensor of shape 128× 128× do with zeros. Next, each object is mapped into this tensor such that the channel corresponding to its class is set to 1 in the region occupied by the bounding box. This operation is performed for all the objects to obtain the 2D mapping of the objects per time step. Additional architecture details. The pose encoder is used to encode the pose before using it as an input to the RNN. It is a convolutional encoder with 8 initial filters. Filter size is doubled after every convolutional layer. The final layer is a fully connected layer with 64 output dimensions. Similarly, the box is encoded before using it as an input to the RNN. The box encoder is a two layer multilayer perceptron with 8 and 16 as output dimensions. The pose decoder is a convolutional decoder with 256 filters. The number of filters gets halved after every convolutional layer. For the second stage of video generation, we use a pix2pixHD architecture with 24 filters. For all the discriminators, we use spectral normalization and 64 filters in the first convolutional layer. 2. Datasets and Pre-processing