Overview
Speakers
Schedule
- Date: September 29, 2024
- Time: 9:10 AM - 1:00 PM
- Location: Amber 3
Time | Instructor | Title |
---|---|---|
9:10 AM | Srikumar Ramalingam | Cornerstones of the Text-To-Pixels Journey |
9:50 AM | Shobhita Sundaram | Image Evaluation Methods |
10:30 AM | Break | |
11:00 AM | Varun Jampani | Thinking Slow and Fast: Recent Trends in 3D Generative Models |
11:30 AM | Dilip Krishnan | Parallel Decoding and Image Generation |
12:00 AM | Sadeep Jayasumana | Structured Prediction Algorithms for Fast Image Generation |
Tutorial Contents
This will introduce basics and fundamentals of text-to-image algorithms such as codebook learning, diffusion and token-based image generation models such as Stable diffusion, Imagen, DALL-E, and Parti.
Image evaluation methodsWe will also discuss evaluation metrics for text-to-image models, such as FID, CMMD, and DreamSim.
Parallel decoding and image generationWe will be focusing on existing methods for token-based image generation that exploits non-autoregressive techniques for achieving speedup. In particular, we will take a closer look at techniques such as MaskGIT and Muse that exploit progressive parallel decoding methods to produce high quality images more efficiently than autoregressive methods such as Parti.
Fast Image Generation TechniquesWe will cover current metrics for image generation (such as FID) and popular image similarity and quality assessment metrics, such as LPIPS \cite{zhang2018perceptual}. We will also discuss recent, efficient image similarity metrics, such as DreamSim, that are trained using synthetic data and parameter efficient fine-tuning of large ViT backbones.
Thinking Slow and Fast: Recent Trends in 3D Generative ModelsExtending text-to-image, there are several newer methods that focus on other modalities such as text-to-3D. We will be discussing the basics of text-to-3D algorithms such as repurposing text-to-image models for multi-view generation, as well as efficient ways to directly predict 3D models from a single image within a few seconds.
References
1. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang,M.H., Murphy, K., Freeman, W.T., Rubinstein, M., Li, Y., Krishnan, D.: Muse:Text-to-image generation via masked generative transformers. ICML (2023)
2. Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener-ative image transformer. In: CVPR (2022)
3. Fu*, S., Tamir*, N., Sundaram*, S., Chai, L., Zhang, R., Dekel, T., Isola, P.:Dreamsim: Learning new dimensions of human visual similarity using syntheticdata. In: NeurIPS (2023)
4. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANsTrained by a Two Time-Scale Update Rule Converge to a Local Nash Equilib-rium (2018)
5. Jayasumana, S., Glasner, D., Ramalingam, S., Veit, A., Chakrabarti, A., Kumar,S.: Markovgen: Structured prediction for efficient text-to-image generation (2023)
6. Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar,S.: Rethinking fid: Towards a better evaluation metric for image generation (2024)
7. Midjourney: (2022), https:://www.midjourney.com8. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using2d diffusion (2022)
9. Raj, A., Kaza, S., Poole, B., Niemeyer, M., Mildenhall, B., Ruiz, N., Zada, S.,Aberman, K., Rubenstein, M., Barron, J., Li, Y., Jampani, V.: Dreambooth3d:Subject-driven text-to-3d generation. ICCV (2023)
10. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. preprint (2022), [arxiv:2204.06125]
11. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolutionimage synthesis with latent diffusion models. In: CVPR (2022)
12. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-booth: Fine tuning text-to-image diffusion models for subject-driven generation(2022)
13. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-derstanding. preprint (2022), [arXiv:2205.11487]
14. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models.In: ICLR (2022)
15. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A.,Yang, Y., Ayan, B.K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H.,Baldridge, J., Wu, Y.: Scaling autoregressive models for content-rich text-to-imagegeneration. In: ICML (2022)
16. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)