PaliGemma

PaliGemma

Model family: Gemma

PaliGemma pairs a compact Gemma language decoder with a high-quality vision encoder to natively “look and read.” It ingests one or more images alongside a prompt and produces grounded, step-by-step text responses—captions, answers, summaries, or structured outputs (Markdown/JSON). It’s instruction-tuned for practical tasks like document OCR and extraction, table/chart interpretation, form understanding, diagram reasoning, and screenshot/UX analysis.
Designed for real apps, PaliGemma is easy to adapt with LoRA or full fine-tuning, integrates cleanly into RAG and agent pipelines (e.g., crop → read → reason), and performs well on a single modern GPU with 8/4-bit quantization options for smaller footprints. Typical uses include enterprise document automation, analytics over dashboards, accessibility (image descriptions), and developer assistants that reason directly from screenshots—bringing reliable visual understanding to the Gemma ecosystem without heavy infrastructure.

Overview

PaliGemma is Google’s open-weight vision-language model in the Gemma family. It takes images (or screenshots, documents, charts) plus text and answers in text—great for OCR, captioning, VQA, and UI/doc understanding. Lightweight and fine-tunable, it runs on a single GPU and supports quantization for edge deployment.

📜OCR 🖼️Image to text 🔍Image recognition

About Google

At Google, we think that AI can meaningfully improve people's lives and that the biggest impact will come when everyone can access it.

Industry: Technology, Information and Internet

Company Size: 182.000-190.000

Location: Mountain View, CA, US

Website: ai.google

View Company Profile

Tools using PaliGemma

No tools found for this model yet.

Last updated: February 26, 2026

Search

Overview

About Google

Other models from this family

Tools using PaliGemma

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: