OCR

Models 25

Gen 3 Mistral

Mistral OCR 4

By Mistral AI

Mistral OCR 4 extracts and structures content from PDF, DOC, PPT, and OpenDocument files, returning text alongside bounding boxes, typed block classification (titles, tables, equations, signatures), and inline confidence scores. Supports 170 languages across 10 language groups. Deployable via API or self-hosted in a single container for data-sovereignty compliance.

🔍Data extraction 📄Document processing 📜OCR

NewMultimodal

Released 13d ago
Gen 3

PP OCRv6

By Baidu

PP-OCRv6 is PaddlePaddle/Baidu’s lightweight universal OCR system for multilingual text detection and recognition across edge, mobile, desktop, and server deployments.

📜OCR 🔍Data extraction 📄Document analysis

NewMultimodal

Released 21d ago
Gen 3

LFM2.5 VL 1.6B Extract

By Liquid AI

LFM2.5-VL-1.6B-Extract is Liquid AI’s larger vision-language extraction model for image-to-JSON structured field extraction.

🔍Data extraction 🔍Image interpretation 📜OCR

NewMultimodal

Released 28d ago
Gen 3

Zamba2 VL 7B

By Zyphra AI

Zamba2-VL-7B is Zyphra’s open 7B-class vision-language model for single-image and multi-image understanding, visual grounding, OCR, charts, documents, and on-device multimodal applications.

🔍Image interpretation 📄Document analysis 📜OCR

NewMultimodal

Released 1mo ago
Gen 3

MiniCPM V4.6.

By OpenBMB

MiniCPM-V-4.6 is OpenBMB’s open-source lightweight multimodal model for efficient image, multi-image, and video understanding on mobile and edge devices.

💬Chatting 🎥Video analysis 🔍Image interpretation 📜OCR

NewMultimodal

Released 1mo ago
Gen 3

LFM2.5 VL 450M Extract

By Liquid AI

LFM2.5-VL-1.6B-Extract is Liquid AI’s 1.6B vision-language extraction model for image-to-JSON structured field extraction.

🔍Data extraction 🔍Image interpretation 📜OCR

NewMultimodal

Released 2mo ago
Gen 4

LFM2.5 VL 450M

By Liquid AI

LFM2.5-VL-450M is Liquid AI’s compact vision-language model for structured visual intelligence from edge to cloud. It is built to turn image streams into grounded, actionable outputs in real time, adding object grounding, better instruction following, multilingual image understanding, and function calling support while staying efficient enough for edge hardware.

🔍Image interpretation 🔍Image recognition 📜OCR

NewImage

Released 2mo ago
Gen 3

Chandra OCR 2

By Datalab

Chandra is an OCR model for difficult document extraction tasks. Its GitHub description says it handles complex tables, forms, and handwriting while preserving full layout structure, making it more document-understanding focused than plain text O

📜OCR 🔍Text extraction 📄Document analysis 🔢Mathematical formula transcription 🔍Handwriting analysis

Multimodal

Released 3mo ago
Gen 3

LiteParse

By LlamaIndex

LiteParse is an open-source document parser focused on fast, lightweight parsing of PDFs into structured outputs.

📄Document processing 📄Document data extraction 🔍Text extraction 📜OCR

Text

Released 3mo ago
Gen 3

Penguin VL 2B

By Tencent

Penguin-VL-2B is a compact vision-language model that uses an LLM-based vision encoder to push efficiency limits in multimodal reasoning.

🔍Image interpretation 📷Image text extraction 🔍Image and document analysis 📜OCR

Multimodal

Released 3mo ago
Gen 3

Qianfan-OCR

By Baidu

Qianfan-OCR is a 4B end-to-end document intelligence vision-language model that performs direct image-to-Markdown conversion and supports prompt-driven document tasks like table extraction, chart understanding, document QA, and key information extraction.

📜OCR 📄Document data extraction 📷Image text extraction 🖼️Image to markdown

Multimodal

Released 3mo ago
Gen 3

Penguin VL 8B

By Tencent

Penguin-VL-2B is a compact vision-language model that uses an LLM-based vision encoder to push efficiency limits in multimodal reasoning.

🔍Image interpretation 📷Image text extraction 🔍Image and document analysis 📜OCR

Multimodal

Released 3mo ago
Gen 7

LightOnOCR 1B

By LightOn

LightOnOCR-1B is a compact vision-language model for OCR that converts document images into clean text and is designed for fast, large-scale document processing.

📜OCR 📷Image text extraction 🔍Image and document analysis 📄PDF analysis

Text

Released 4mo ago
Gen 7 DeepSeek

DeepSeek OCR 2

By DeepSeek

Second-generation DeepSeek OCR model, “Visual Causal Flow,” aimed at more human-like visual encoding, with dynamic resolution support and strong document-to-Markdown and layout-aware OCR for images and PDFs.

📜OCR

Text

Released 5mo ago
Gen 7

NuMarkdown 8 B Thinking

By NuMind

NuMarkdown-8B-Thinking is a reasoning OCR vision-language model fine-tuned from Qwen2.5-VL to convert complex document images into clean Markdown, using intermediate “thinking” tokens to infer layout and tables before generating the final text

📜OCR 🖼️Image to text

Text

Released 6mo ago
Gen 7 LFM

LFM2 VL 3B

By Liquid AI

LFM2-VL-3B is a 3B vision-language model that reads images with text and answers in natural language or structured JSON. It handles OCR, charts, tables, and screenshots with long context and low-latency streaming, making it practical for multimodal RAG and assistants.

📜OCR 🖼️Image to text 🗒Transcription 🖼️Logos

Text

Released 8mo ago
Gen 7 DeepSeek

DeepSeek OCR

By DeepSeek

LLM-centric OCR model using “Contexts Optical Compression” to explore visual-text compression and provide fast streaming and batch OCR for images and PDFs via vLLM and Transformers runtimes.

📜OCR

Text

Released 8mo ago
Gen 3 Qianfan

Qianfan VL 70B

By Baidu

Qianfan-VL 70B is Baidu’s large vision-language model on the Qianfan platform. It ingests images (docs, charts, screenshots, photos) with text and produces grounded answers, featuring strong OCR and layout understanding, long context, tool/function calling, streaming, and reliable JSON outputs for multimodal RAG and enterprise apps.

📜OCR 🖼️3D image generation 🎬Video dubbing 🔍Image recognition

Text

Released 9mo ago
Gen 3

Moondream 3 Preview

By Moondream

Moondream 3 Preview is a compact frontier-oriented vision-language model built for fast visual reasoning, grounding, OCR, object detection, pointing, and structured output. It uses a 9B MoE architecture with 2B active parameters and extends context length to 32K, aiming to deliver strong real-world vision performance while staying efficient and inexpensive to run.

🔍Image interpretation 🔍Image recognition 🖼️Image segmentation 📜OCR

Multimodal

Released 9mo ago
Gen 3 Command

Command A Vision

By Caldera Labs

Command A Vision is Cohere’s multimodal instruction model that pairs text and image understanding. It accepts images plus text prompts and outputs structured, step-by-step text answers. It’s tuned for enterprise workflows like document OCR, chart/diagram reasoning, screenshot/UI analysis, and tool or function calling.

📜OCR 🖼️Image to text 🔍Image recognition

Text

Released 11mo ago
Gen 3 Kanana

Kanana 1.5

By Naver

Kanana-1.5-v-3B is a 3B-parameter vision–language model in Kakao’s Kanana line. It can process both images and text prompts, outputting grounded answers in natural language or structured JSON. It’s optimized for lightweight multimodal assistants and enterprise applications that need efficiency with visual reasoning.

📜OCR 🔍Image recognition

Text

Released 11mo ago
Gen 3

FastVLM

By Apple

FastVLM is Apple’s lightweight vision-language model built for real-time multimodal apps. It ingests images alongside text and returns grounded answers fast—OCR, charts/diagrams, screenshots, and general visual QA—while supporting long context, tool/function calling, and structured JSON outputs.

📜OCR 🔍SEO content 📞Customer support 📊Data analysis

Text

Released 11mo ago
Gen 3

Aya Vision

By Caldera Labs

Aya Vision is the multimodal sibling of the Aya family. It processes images alongside text prompts and produces grounded text answers, designed for tasks like document OCR, chart/diagram analysis, UI/screenshot reasoning, and visual Q&A across multiple languages.

📜OCR 🎮Game creation 🖼️Logos 🔍Content optimization

Text

Released 1y ago
Gen 3 Mistral

Pixtral Large

By Mistral AI

Pixtral Large is Mistral’s flagship vision-language model. It takes images plus text and returns grounded, step-by-step answers—great for document OCR, charts/diagrams, UI screenshots, and general visual QA—with long-context support, tool/function calling, and reliable JSON outputs.

📜OCR 🖼️Image to text 🗒Transcription 🌐SEO

Text

Released 1y ago
Gen 7 Gemma

PaliGemma

By Google DeepMind

PaliGemma is Google’s open vision-language model that accepts images plus text and outputs text for captioning, visual question answering, OCR-style tasks, and detection.

📜OCR

Text

Released 2y ago

Discussion(2)

📜 OCR

We are a team of founders who worked before in data extraction, with traditional manual review of each document to check accuracy / errors. We are so excited about AI and the new LLM models because we have created DeepRead using those and the accuracy is 95% and we flag uncertain fields so no one has to manually review entire documents (whew), only the exceptions! We hope you try it out and please share your review, comments and feedback. Looking forward to hearing from users!

4 Reply Share Delete Report

GNN Murthy

1y ago

Languages supported for handwriting transcription?

Reply Share Delete Report

Attach prompt

Attach result

Post ➤

Search

OCR

Specialized tools 1

Related Tasks✕

Models 25

Repositories 9

Go to section

Search

OCR

Specialized tools 1

Related Tasks✕

Models 25

Repositories 9

Help

People also viewed

Task Options

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: