Instruct-Imagen: Imagen Generation with Multi-modal Instruction

Hexiang Hu^⬥*

Kelvin C.K. Chan^⬦*

Yu-Chuan Su^⬦*

Wenhu Chen^⬥*

Yandong Li^⬦

Kihyuk Sohn^⬦

Yang Zhao^⬦

Xue Ben^⬦

Boqing Gong^⬦

William W. Cohen^⬥

Ming-Wei Chang^⬥

Xuhui Jia^⬦

(*: Authors contributed equally)

⬥ :

⬦ :

This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format.

We then build the instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Instruct-Imagen extends a pre-trained text-to-image imagen model with two-stage fine-tuning

With a small number of additional parameters (10%), Instruct-Imagen can be trained to understand multi-modal instructions, without any descriminative training.

          Cite us using the following BibTex:
          
          @article{hu2024instruct,
            title={Instruct-Imagen: Image Generation with Multi-modal Instruction}, 
            author={Hu, Hexiang and Chan, Kelvin and Su, Yu-Chuan and Chen, Wenhu 
                    and Li, Yandong and Yang, Zhao and Ben, Xue and Gong, Boqing 
                    and Cohen, William W and Chang, Ming-Wei and Jia, Xuhui},
            journal={arXiv preprint arXiv:2401.01952},
            year={2024},
          }

Border Impact

Text-to-image generation models like Imagen and Stable Diffusion present ethical concerns, including social bias. Instruct-Imagen, using similar Web-scale datasets, faces these same issues. Instruct-Imagen's retrieval-augmented training and multi-modal instruction-tuning have notably enhanced image controllability and attribution. This control can be beneficial or harmful. A risk is using Instruct-Imagen for malicious activities, such as creating misleading images of people. Conversely, it offers advantages, like reducing image hallucination and improving relevance to user intent. It also benefits minority communities by effectively generating images of less-known landmarks, foods, and cultural artifacts, addressing the bias in AI systems. To mitigate public risks, we'll be cautious with code and API releases. Future work will focus on a responsible use framework, weighing the benefits of research transparency against the dangers of open access, ensuring safe and beneficial usage..

Special Thanks

We thank Zhiwei Deng, Jason Baldridge, Nando de Freitas for reviewing an early version of this paper in depth, with valuable comments and suggestions. Special thanks to Han Zhang for project idea discussion in the early stage of this project. We also thank Irina Blok for providing a style image used in our evaluation.