论文信息 - MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we take advantage of the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator for mobile devices. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts to the video for enriching visual expression, and specify texts for automatic reading with optional voices as they like.

[1] Jianlong Fu,et al. MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , 2023, ACM Multimedia.

[2] Jianlong Fu,et al. VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation , 2023, ArXiv.

[3] Z. Li,et al. AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Seung Wook Kim,et al. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] J. Liu,et al. Sounding Video Generator: A Unified Framework for Text-Guided Sounding Video Generation , 2023, IEEE Transactions on Multimedia.

[6] Jinyu Li,et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , 2023, ArXiv.

[7] Nicholas Jing Yuan,et al. MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Yaniv Taigman,et al. Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[9] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] João F. Henriques,et al. Audio Retrieval With Natural Language Queries: A Benchmark Study , 2021, IEEE Transactions on Multimedia.

[11] Zeynep Akata,et al. Audio Retrieval with Natural Language Queries , 2021, Interspeech.

[12] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.