Taming Transformers

Overview

Taming Transformers is a method for high-resolution image synthesis that compresses images into discrete tokens with a VQ-style autoencoder, then trains a Transformer to model those tokens. This makes large images practical to generate with good fidelity and control.

Description

The approach has two stages. First, a convolutional encoder-decoder with a learned codebook (often called VQGAN) maps images to a grid of discrete indices, and reconstructs them with perceptual and adversarial losses so the latent space is compact but detailed. Second, an autoregressive Transformer learns the distribution over these latent tokens, optionally conditioned on text, class labels, or layouts. By modeling a short sequence of discrete latents instead of raw pixels, the Transformer becomes fast enough and data-efficient enough to handle high resolutions while keeping textures and structure intact. The result is a flexible recipe for controllable image generation and editing, and it influenced many later latent and token-based generative models used in creative and design workflows.

About CompVis

CompVis is a research group focusing on computer vision and deep learning.

Industry: Artificial Intelligence

Company Size: N/A

Location: Heidelberg, DE

Website: compvis

View Company Profile

Related Models

Last updated: October 15, 2025

Overview

Description

About CompVis

Related Models

Midjourney V6

DCGAN

Seed3D 1.0

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool