Efficient Text-to-Image/Video modeling| [“CVPR 2025 Tutorial”, “June, 12, 2025”, “Nashville, USA”]

Overview

We are witnessing groundbreaking results in image-to-text and image-to-video models. However, the generation process with these models is iterative and computationally expensive, requiring multiple sampling steps through large models. There is a growing need to make these algorithms faster for serving millions of users without the use of too many GPUs/TPUs. In this course, we will focus on techniques such as progressive parallel decoding, distillation methods, and Markov Random Fields to achieve speedup on generative models.

Speakers

Richard Hartley
Australian National University

Sadeep Jayasumana
OCTAVE | Ex-Google AI Research

Ameesh Makadia
Google Research

Srikumar Ramalingam
Google Research

Schedule

Date: June 12, 2024
Time: 9:00 AM - 12:30 PM
Location: 202 A

Time	Instructor	Title
9:00 AM	Richard Hartley	Mathematics of Diffusion Models
9:45 AM	Srikumar Ramalingam	Cornerstones of the text-to-image Generation
10:30 AM	Break
11:00 AM	Sadeep Jayasumana	Efficient Text-to-Image Generation via Structured Discrete Prediction
11:45 AM	Ameesh Makadia	Latent representations for efficient text-to-image and text-to-video generation

Tutorial Contents

Mathematics of Diffusion Models

We will cover the mathematics/fundamentals of diffusion models [6], which is the building block for many generative methods. More emphasis will be given on the underlying theory/fundamentals, which has received minimal attention in the community.

Efficient methods and cornerstones of t2i Generation

We will provide some background on the text-to-image generation and then go over temporal-distillation and MRF-based algorithms [3] for improving the efficiency of token-based methods such as Muse [1]..

Continuous MRF and FoE model for t2i Generation

We will cover current metrics for image generation (such as FID) and improved metrics such as CMMD [4]. Newer methods to speedup diffusion models using MRF And Field of Experts models will be discussed..

Efficient Text-to-3D and Text-to-Video generation

We will be giving an overview of generative algorithms in 3D and video space, and particularly covering efficient algorithms driven by geometric priors for video generation [7]..

Latent representations for efficient text-to-image and text-to-video generation

We will provide an overview of different strategies for image [2] and video [7] tokenization that can improve generation efficiency. Time permitting, we will cover "data-efficient" diffusion models that can be trained on only a single 3D mesh [12].

References

1. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K., Freeman, W.T., Rubinstein, M., Li, Y., Krishnan, D.: Muse: Text-to-image generation via masked generative transformers. ICML (2023)

2. Esteves, C., Suhail, M., Makadia, A.: Spectral image tokenizers (2024)

3. Jayasumana, S., Glasner, D., Ramalingam, S., Veit, A., Chakrabarti, A., Kumar, S.: Markovgen: Structured prediction for efficient text-to-image generation (2023)

4. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking fid: Towards a better evaluation metric for image generation (2024)

5. Mitchel, T., Esteves, C., Makadia, A.: Single mesh diffusion models with field latents for texture generation. In: CVPR (2024)

6. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Proceedings of the 32nd International Conference on Machine Learning (2015)

7. Suhail, M., Esteves, C., Sigal, L., Makadia, A.: Four-plane factorized video autoencoders (2024)

8. Vice, J., Akhtar, N., Hartley, R., Mian, A.: On the fairness, diversity and reliability of textto-image generative models (2024)

9. Vice, J., Akhtar, N., Hartley, R., Mian, A.: Safety without semantic disruptions: Editingfree safe image generation via context-preserving dual latent reconstruction (2024)

10. Yang, Z., Yu, Z., Xu, Z., Singh, J., Zhang, J., Campbell, D., Tu, P., Hartley, R.: Impus: Image morphing with perceptually-uniform sampling using diffusion models (2024),

11. Ranasinghe, K., Jayasumana, S., Veit, A., Chakrabarti, A., Glasner, D., Ryoo, M., Ramalingam, S., Kumar, S., LatentCRF: Continuous CRF for Efficient Latent Diffusion, arxiv 2025

12. Mitchel, T., Esteves, C., Makadia, A.: Single mesh diffusion models with field latents for texture generation. In: CVPR (2024)

Code Repository References

CMMD

Efficient Text-to-Image/Video modeling

CVPR 2025 TutorialJune, 12, 2024Nashville, USA