Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning

We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion-based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design. First, the LLM enhances a basic human input (such as"panda") by generating more comprehensive NFT-style prompts that include specific visual attributes, such as"panda with Ninja style and green background."Second, the diffusion-based image generator is fine-tuned using a large-scale NFT dataset to capture fine-grained image styles and accessory compositions of popular NFT elements. Third, we further propose to utilize multiple visual-policies as optimization goals, including visual rarity levels, visual aesthetic scores, and CLIP-based text-image relevances. This design ensures that our proposed Diffusion-MVP is capable of minting NFT images with high visual quality and market value. To facilitate this research, we have collected the largest publicly available NFT image dataset to date, consisting of 1.5 million high-quality images with corresponding texts and market values. Extensive experiments including objective evaluations and user studies demonstrate that our framework can generate NFT images showing more visually engaging elements and higher market value, compared with SOTA approaches.

[1]  Jianlong Fu,et al.  MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , 2023, ACM Multimedia.

[2]  Jianlong Fu,et al.  VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation , 2023, ArXiv.

[3]  Jianlong Fu,et al.  Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution , 2023, ArXiv.

[4]  Jianlong Fu,et al.  Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation , 2023, ArXiv.

[5]  Jingren Zhou,et al.  Composer: Creative and Controllable Image Synthesis with Composable Conditions , 2023, ICML.

[6]  Lucio La Cava,et al.  Show me your NFT and I tell you how it will perform: Multimodal representation learning for NFT selling price prediction , 2023, WWW.

[7]  Nicholas Jing Yuan,et al.  MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Li Dong,et al.  Optimizing Prompts for Text-to-Image Generation , 2022, NeurIPS.

[9]  Tom B. Brown,et al.  Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[10]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[11]  Linlin Shen,et al.  Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks , 2022, ACM Multimedia.

[12]  Yejin Choi,et al.  Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , 2022, ArXiv.

[13]  Jianlong Fu,et al.  AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation , 2022, ACM Multimedia.

[14]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  B. Guo,et al.  Language-Guided Face Animation by Recurrent StyleGAN-Based Generator , 2022, IEEE Transactions on Multimedia.

[16]  Jianlong Fu,et al.  Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution , 2022, ECCV.

[17]  Jianlong Fu,et al.  TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation , 2022, IEEE Transactions on Image Processing.

[18]  Jianlong Fu,et al.  Degradation-Guided Meta-Restoration Network for Blind Super-Resolution , 2022, ArXiv.

[19]  Anatoli Colicev How Can Non-Fungible Tokens bring value to brands , 2022, International Journal of Research in Marketing.

[20]  Cheng Lu,et al.  DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps , 2022, NeurIPS.

[21]  Yejin Choi,et al.  Quark: Controllable Text Generation with Reinforced Unlearning , 2022, NeurIPS.

[22]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[23]  J. Fidrmuc,et al.  Price Determinants of Non-fungible Tokens in the Digital Art Market , 2022, Finance Research Letters.

[24]  K. Passi,et al.  Characterizing the OpenSea NFT Marketplace , 2022, WWW.

[25]  A. Baronchelli,et al.  Heterogeneous rarity patterns drive price dynamics in NFT collections , 2022, Scientific Reports.

[26]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[27]  Jianlong Fu,et al.  Learning Trajectory-Aware Transformer for Video Super-Resolution , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jacob Menick,et al.  Teaching language models to support answers with verified quotes , 2022, ArXiv.

[29]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[30]  P. Kumaraguru,et al.  TweetBoost: Influence of Social Media on NFT Valuation , 2022, WWW.

[31]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[33]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[34]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[35]  Ying Shan,et al.  Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[36]  Andrea Baronchelli,et al.  Mapping the NFT revolution: market trends, trade networks, and visual features , 2021, Scientific Reports.

[37]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Face Image Generation and Manipulation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[39]  Shangchen Zhou,et al.  BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[41]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[42]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[43]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[44]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[45]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[46]  N. Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[47]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[48]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  David Bau,et al.  Diverse Image Generation via Self-Conditioned GANs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[52]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[55]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Vincent Dumoulin,et al.  Generative Adversarial Networks: An Overview , 2017, 1710.07035.

[59]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[61]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[62]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[63]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[64]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[65]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[67]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[68]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[69]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[70]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[71]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[72]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[73]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[74]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[75]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[76]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[77]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[78]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.