Tutorial T05: Efficient Text-to-Image and Text-to-3D modeling| [“ECCV 2024 Tutorial”, “September, 29, 2024”, “Milan, Italy”]

Overview

We are witnessing groundbreaking results in image-to-text and image-to-3D models. However, the generation process with these models is iterative and computationally expensive, requiring multiple sampling steps through large models. There is a growing need to make these algorithms faster for serving millions of users without the use of too many GPUs/TPUs. In this course, we will focus on techniques such as progressive parallel decoding, distillation methods, and Markov Random Fields to achieve speedup on generative models. The course will also focus on highlighting the limitations of popular image evaluation techniques such as FID and providing efficient alternative approaches such as CMMD and DreamSim.

Speakers

Shobhita Sundaram
Massachusetts Institute of Technology (MIT)

Sadeep Jayasumana
Google Research

Varun Jampani
Stability AI

Dilip Krishnan
Google DeepMind

Srikumar Ramalingam
Google Research

Schedule

Date: September 29, 2024
Time: 9:10 AM - 1:00 PM
Location: Amber 3

Time	Instructor	Title
9:10 AM	Srikumar Ramalingam	Cornerstones of the Text-To-Pixels Journey
9:50 AM	Shobhita Sundaram	Image Evaluation Methods
10:30 AM	Break
11:00 AM	Varun Jampani	Thinking Slow and Fast: Recent Trends in 3D Generative Models
11:30 AM	Dilip Krishnan	Parallel Decoding and Image Generation
12:00 AM	Sadeep Jayasumana	Structured Prediction Algorithms for Fast Image Generation

Tutorial Contents

Cornerstones of the Text-to-Pixels Journey

This will introduce basics and fundamentals of text-to-image algorithms such as codebook learning, diffusion and token-based image generation models such as Stable diffusion, Imagen, DALL-E, and Parti.

Image evaluation methods

We will also discuss evaluation metrics for text-to-image models, such as FID, CMMD, and DreamSim.

Parallel decoding and image generation

We will be focusing on existing methods for token-based image generation that exploits non-autoregressive techniques for achieving speedup. In particular, we will take a closer look at techniques such as MaskGIT and Muse that exploit progressive parallel decoding methods to produce high quality images more efficiently than autoregressive methods such as Parti.

Fast Image Generation Techniques

We will cover current metrics for image generation (such as FID) and popular image similarity and quality assessment metrics, such as LPIPS \cite{zhang2018perceptual}. We will also discuss recent, efficient image similarity metrics, such as DreamSim, that are trained using synthetic data and parameter efficient fine-tuning of large ViT backbones.

Thinking Slow and Fast: Recent Trends in 3D Generative Models

Extending text-to-image, there are several newer methods that focus on other modalities such as text-to-3D. We will be discussing the basics of text-to-3D algorithms such as repurposing text-to-image models for multi-view generation, as well as efficient ways to directly predict 3D models from a single image within a few seconds.

References

1. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang,M.H., Murphy, K., Freeman, W.T., Rubinstein, M., Li, Y., Krishnan, D.: Muse:Text-to-image generation via masked generative transformers. ICML (2023)

2. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener-ative image transformer. In: CVPR (2022)

3. Fu*, S., Tamir*, N., Sundaram*, S., Chai, L., Zhang, R., Dekel, T., Isola, P.:Dreamsim: Learning new dimensions of human visual similarity using syntheticdata. In: NeurIPS (2023)

4. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANsTrained by a Two Time-Scale Update Rule Converge to a Local Nash Equilib-rium (2018)

5. Jayasumana, S., Glasner, D., Ramalingam, S., Veit, A., Chakrabarti, A., Kumar,S.: Markovgen: Structured prediction for efficient text-to-image generation (2023)

6. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar,S.: Rethinking fid: Towards a better evaluation metric for image generation (2024)

7. Midjourney: (2022), https:://www.midjourney.com8. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using2d diffusion (2022)

9. Raj, A., Kaza, S., Poole, B., Niemeyer, M., Mildenhall, B., Ruiz, N., Zada, S.,Aberman, K., Rubenstein, M., Barron, J., Li, Y., Jampani, V.: Dreambooth3d:Subject-driven text-to-3d generation. ICCV (2023)

10. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. preprint (2022), [arxiv:2204.06125]

11. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolutionimage synthesis with latent diffusion models. In: CVPR (2022)

12. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-booth: Fine tuning text-to-image diffusion models for subject-driven generation(2022)

13. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-derstanding. preprint (2022), [arXiv:2205.11487]

14. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models.In: ICLR (2022)

15. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A.,Yang, Y., Ayan, B.K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H.,Baldridge, J., Wu, Y.: Scaling autoregressive models for content-rich text-to-imagegeneration. In: ICML (2022)

16. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

Efficient Text-to-Image and Text-to-3D modeling

ECCV 2024 Tutorial
September, 29, 2024
Milan, Italy

Overview

Speakers

Schedule

Tutorial Contents

References

Code Repository References

Efficient Text-to-Image and Text-to-3D modeling

ECCV 2024 TutorialSeptember, 29, 2024Milan, Italy

Overview

Speakers

Schedule

Tutorial Contents

References

Code Repository References

ECCV 2024 Tutorial
September, 29, 2024
Milan, Italy