TAAFT
Free mode
100% free
Freemium
Free Trial
Deals
Create tool

Taming Transformers

By CompVis
New Image Gen 1
Released: December 18, 2020

Overview

Taming Transformers is a method for high-resolution image synthesis that compresses images into discrete tokens with a VQ-style autoencoder, then trains a Transformer to model those tokens. This makes large images practical to generate with good fidelity and control.

Description

The approach has two stages. First, a convolutional encoder-decoder with a learned codebook (often called VQGAN) maps images to a grid of discrete indices, and reconstructs them with perceptual and adversarial losses so the latent space is compact but detailed. Second, an autoregressive Transformer learns the distribution over these latent tokens, optionally conditioned on text, class labels, or layouts. By modeling a short sequence of discrete latents instead of raw pixels, the Transformer becomes fast enough and data-efficient enough to handle high resolutions while keeping textures and structure intact. The result is a flexible recipe for controllable image generation and editing, and it influenced many later latent and token-based generative models used in creative and design workflows.

About CompVis

CompVis is a research group focusing on computer vision and deep learning.

Industry: Artificial Intelligence
Company Size: N/A
Location: Heidelberg, DE
Website: compvis
View Company Profile

Related Models

Last updated: October 15, 2025