TAAFT
Free mode
100% free
Freemium
Free Trial
Create tool

Qwen 2.5-VL-72B

By Alibaba
New Multimodal Gen
Released: January 28, 2025

Overview

Qwen 2.5-VL-72B is Alibaba’s flagship open-weight vision-language model. It takes images (docs, charts, screenshots, photos) plus text and answers in text, with strong OCR, layout understanding, and multi-image reasoning. It supports long context, function/tool calling, and reliable JSON outputs—ideal for multimodal RAG, agents, and enterprise workflows.

Description

Qwen 2.5-VL-72B pairs a large 72B-parameter language model with a high-quality vision encoder so it can “look, read, and reason” in one pass. It handles everything from dense documents and tables to diagrams, dashboards, and natural images, keeping track of small text and layout while following detailed instructions. The model is instruction-tuned to produce grounded explanations and structured outputs, and it can reference specific regions when you ask it to point out where an answer comes from. Long-context prompting lets it work across multi-page PDFs or image sequences, and native function calling makes it easy to plug into tool-using agents and retrieval pipelines. In practice, teams use it for document automation, chart and UI understanding, multimodal search and RAG, and developer assistants that reason directly from screenshots. Open weights and common runtimes make deployment straightforward; quantization and multi-GPU parallelism help keep latency and cost in check without giving up the accuracy you want from a flagship VLM.

About Alibaba

Chinese e-commerce and cloud leader behind Taobao, Tmall, and Alipay.

Website: alibaba.com
View Company Profile

Related Models

Last updated: September 22, 2025