Baidu
Models
-
Qianfan-OCR is a 4B end-to-end document intelligence vision-language model that performs direct image-to-Markdown conversion and supports prompt-driven document tasks like table extraction, chart understanding, document QA, and key information extraction.NewMultimodalReleased 12d ago
-
Production ready OCR and document AI toolkit that turns images and PDFs into structured data, with multilingual OCR, layout analysis and VLM based document parsing.NewTextReleased 1mo ago
-
ImageReleased 3mo ago
-
A multimodal MoE model that “looks, reads, and reasons” across images, video, and text. It adds tool use and a Thinking with Images mode, supports long context, and activates about 3B parameters per token for flagship-level VLM quality at practical latency.TextReleased 4mo ago
-
ERNIE 5 is Baidu’s next-gen general model for reasoning, coding, and multimodal understanding. It supports long context, tool and function calling, reliable JSON, streaming, and enterprise guardrails, making it a strong default for RAG, agents, and document or chart analysis.TextReleased 4mo ago
-
PaddleOCR-VL is a vision-language model built around PaddleOCR that reads documents, forms, tables, charts, and screenshots. It combines strong OCR with reasoning over layout and content, then answers in text or structured JSON for multimodal RAG and automation.MultimodalReleased 5mo ago
-
Qianfan-VL-3B is Baidu’s lightweight VLM for cost-sensitive, real-time multimodal apps. It processes images plus text and returns grounded answers with basic OCR and layout understanding, long context, tool/function calling, and JSON outputs—optimized for speed and efficiency.TextReleased 6mo ago
-
Qianfan-VL-8B is Baidu’s mid-size vision-language model. It reads images (docs, charts, screenshots, photos) alongside text and returns grounded answers with solid OCR, layout understanding, multi-image reasoning, long context, tool/function calling, and reliable JSON outputs—balanced for quality and latency.TextReleased 6mo ago
-
Qianfan-VL 70B is Baidu’s large vision-language model on the Qianfan platform. It ingests images (docs, charts, screenshots, photos) with text and produces grounded answers, featuring strong OCR and layout understanding, long context, tool/function calling, streaming, and reliable JSON outputs for multimodal RAG and enterprise apps.TextReleased 6mo ago
-
ERNIE 4.5 Turbo is Baidu’s high-throughput, cost-optimized variant of ERNIE 4.5. It delivers strong reasoning and coding with long-context options, tool/function calling, JSON outputs, and streaming—ready for production via ERNIE Bot and the Qianfan API.TextReleased 6mo ago
-
ERNIE X1.1 is Baidu’s upgraded “deep-thinking” reasoning model, unveiled on Sept 9, 2025 at Wave Summit. Versus ERNIE X1, it boosts factuality (+34.8%), instruction following (+12.5%), and agentic skills (+9.6%). It’s available in ERNIE Bot/Wenxiaoyan and via the Qianfan APITextReleased 6mo ago
-
ERNIE 4.5-21B-A3B is Baidu’s efficient MoE variant of ERNIE 4.5—about 21B total parameters with ~3B active per token—built to balance strong reasoning and coding accuracy with low latency. It supports long context, tool/function calling, structured JSON output, and streaming via ERNIE Bot and the Qianfan API.TextReleased 8mo ago
