Title: Efficient Text-to-Image and Text-to-3D modeling

Organizers: Shobhita Sundaram, Sadeep Jayasumana, Varun Jampani, Dilip Krishnan, Srikumar Ramalingam

Tutorial Schedule

Time Instructor Title
9:00 AM Organizers Overview of the Tutorial
9:15 AM Srikumar Ramalingam Introduction to text-to-image models
9:45 AM Dilip Krishnan Parallel Decoding for Image Generation
10:15 AM Break
10:45 AM Sadeep Jayasumana Fast Image Generation Algorithms
11:15 AM Shobitha Sundaram Image Evaluation Methods
11:45 AM Varun Jampani Efficient Text-to-3D modeling
12:15 AM Organizers Open Problems

Tutorial Contents

Introduction to text-to-image models

We will cover the basics and fundamentals of diffusion and token-based image generation models such as Stable diffusion, Imagen, DALL-E, and Parti. We will be looking at both pixel-based and token-based diffusion models. We will also discuss evaluation metrics for text-to-image models, such as FID and CMMD.

Parallel decoding and image generation

We will be focusing on existing methods for token-based image generation that exploits non-autoregressive techniques for achieving speedup. In particular, we will take a closer look at techniques such as MaskGIT and Muse that exploit progressive parallel decoding methods to produce high quality images more efficiently than autoregressive methods such as Parti.

Fast Image Generation Techniques

We will cover current metrics for image generation (such as FID) and popular image similarity and quality assessment metrics, such as LPIPS \cite{zhang2018perceptual}. We will also discuss recent, efficient image similarity metrics, such as DreamSim, that are trained using synthetic data and parameter efficient fine-tuning of large ViT backbones.

Efficient Text-to-3D Algorithms

Extending text-to-image, there are several newer methods that focus on other modalities such as text-to-3D. We will be discussing the basics of text-to-3D algorithms such as repurposing text-to-image models for multi-view generation, as well as efficient ways to directly predict 3D models from a single image within a few seconds.


1. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang,M.H., Murphy, K., Freeman, W.T., Rubinstein, M., Li, Y., Krishnan, D.: Muse:Text-to-image generation via masked generative transformers. ICML (2023)

2. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener-ative image transformer. In: CVPR (2022)

3. Fu*, S., Tamir*, N., Sundaram*, S., Chai, L., Zhang, R., Dekel, T., Isola, P.:Dreamsim: Learning new dimensions of human visual similarity using syntheticdata. In: NeurIPS (2023)

4. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANsTrained by a Two Time-Scale Update Rule Converge to a Local Nash Equilib-rium (2018)

5. Jayasumana, S., Glasner, D., Ramalingam, S., Veit, A., Chakrabarti, A., Kumar,S.: Markovgen: Structured prediction for efficient text-to-image generation (2023)

6. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar,S.: Rethinking fid: Towards a better evaluation metric for image generation (2024)

7. Midjourney: (2022), https:://www.midjourney.com8. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using2d diffusion (2022)

9. Raj, A., Kaza, S., Poole, B., Niemeyer, M., Mildenhall, B., Ruiz, N., Zada, S.,Aberman, K., Rubenstein, M., Barron, J., Li, Y., Jampani, V.: Dreambooth3d:Subject-driven text-to-3d generation. ICCV (2023)

10. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. preprint (2022), [arxiv:2204.06125]

11. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolutionimage synthesis with latent diffusion models. In: CVPR (2022)

12. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-booth: Fine tuning text-to-image diffusion models for subject-driven generation(2022)

13. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-derstanding. preprint (2022), [arXiv:2205.11487]

14. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models.In: ICLR (2022)

15. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A.,Yang, Y., Ayan, B.K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H.,Baldridge, J., Wu, Y.: Scaling autoregressive models for content-rich text-to-imagegeneration. In: ICML (2022)

16. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

Code Repository References