Image interpretation

By Liquid AI

LFM2.5-VL-1.6B-Extract is Liquid AI’s larger vision-language extraction model for image-to-JSON structured field extraction.

🔍Data extraction 🔍Image interpretation 📜OCR

NewMultimodal

Released 1mo ago

Gen 3

Zamba2 VL 7B

By Zyphra AI

Zamba2-VL-7B is Zyphra’s open 7B-class vision-language model for single-image and multi-image understanding, visual grounding, OCR, charts, documents, and on-device multimodal applications.

🔍Image interpretation 📄Document analysis 📜OCR

NewMultimodal

Released 1mo ago

Gen 3 MiniMax

MiniMax M3

By MiniMax

MiniMax M3 is MiniMax’s open-weight multimodal model for agentic coding, tool use, long-context tasks, and native text-visual reasoning.

💬Chatting 🤖Agents 🔍Image interpretation 💻Conversational coding 💻Vibe coding

NewMultimodal

Released 1mo ago

Gen 3

Qwen3.7-Plus

By Alibaba

Qwen3.7-Plus is Alibaba Qwen’s multimodal agent model that unifies vision and language for agentic vision-language workflows.

💬Chatting 🤖Agents 🔍Image interpretation 💻Conversational coding

NewMultimodal

Released 1mo ago

Gen 3 Claude

Claude Opus 4.8

By Anthropic

Claude Opus 4.8 is Anthropic's new flagship model, released May 28, 2026. It improves on Opus 4.7 with stronger coding, more honest self-assessment, and a faster, cheaper fast mode, at the same standard pricing. New features include user-controlled effort levels and Dynamic Workflows for parallel subagents.

🚀Productivity 💵Financial analysis 💻Coding 🤖Agents 🔍Image interpretation

NewMultimodal

Released 1mo ago

Gen 3 Command

Command A+ W4A4

By Cohere

Command A+ 05-2026 W4A4 is Cohere’s open-source quantized vision-language reasoning model for agentic, multilingual, tool-use, and enterprise tasks.

💬Chatting 🔍Image interpretation 💻Conversational coding

NewMultimodal

Released 2mo ago

Gen 3

Lance

By ByteDance

Lance is ByteDance’s open-source 3B active-parameter unified multimodal model for image and video understanding, generation, and editing.

📷Images 🖌️Image editing 🎥Videos 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 3

MiniCPM V4.6.

By OpenBMB

MiniCPM-V-4.6 is OpenBMB’s open-source lightweight multimodal model for efficient image, multi-image, and video understanding on mobile and edge devices.

💬Chatting 🎥Video analysis 🔍Image interpretation 📜OCR

NewMultimodal

Released 2mo ago

Gen 3 GPT

GPT 5.5 Instant

By OpenAI

GPT-5.5 Instant is OpenAI’s updated default ChatGPT model for fast everyday use. It is built for clearer, more concise, and more personalized responses, with better factual accuracy, stronger image understanding, improved STEM performance, and better judgment about when to use web search

💬Chatting 📊Data analysis 🔍Web search summaries 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 3

Uni 1.1

By Luma AI

Uni-1.1 API is Luma’s closed-source REST API for image generation and natural-language image editing using its Unified Intelligence model.

📷Images 🖌️Image editing 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 3 Nemotron

Nemotron 3 Nano Omni

By NVIDIA

Nemotron 3 Nano Omni is NVIDIA’s open multimodal reasoning model for agentic systems. It unifies text, image, video, and audio in a single efficient 30B-A3B hybrid MoE model, built to replace fragmented vision-language-audio stacks with one shared perception-and-context model for multimodal agents.

🔍Image interpretation 🎥Video analysis 🤖Agents 🗒Transcription 📄Document analysis 🤔Logical reasoning

NewMultimodal

Released 2mo ago

Gen 3

Ornstein Hermes 3.6 27b MLX 8bit

By GestaltLabs

Ornstein-Hermes-3.6-27b-MLX-8bit is Gestalt Labs’ 8-bit MLX quantization of Ornstein-Hermes-3.6-27b, a Hermes-format function-calling fine-tune of Qwen 3.6 27B multimodal. It is optimized for Apple Silicon, supports image-text-to-text use, and targets agentic tool use with near-lossless 8-bit compression.

💬Chatting 🤖Agents 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 3 Mistral

Mistral Medium 3.5

By Mistral AI

Mistral Medium 3.5 is Mistral’s new flagship merged model for instruction following, reasoning, and coding. It is a dense 128B model with a 256K context window, built for long-horizon productivity and agentic work, with configurable reasoning effort and strong self-hosted efficiency.

💬Chatting 💻Coding 🤖Agents 🤔Logical reasoning 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 3

SenseNova U1 8B MoT

By SenseTime

SenseNova-U1-8B-MoT is SenseNova’s open native multimodal model for unified image understanding, reasoning, generation, and editing. It is built on the NEO-Unify architecture, uses an 8B dense MoT backbone, and supports text-to-image, image-to-text, image editing, and interleaved image-text generation in one model.

📷Images 🖌️Image editing 📊Infographics 🤖Agents 🔍Image interpretation

NewMultimodal

Released 2mo ago

Gen 7

Carnice V2 27B

By Kai Stephens

Carnice-V2-27B is a BF16 supervised fine-tune of Qwen3.6-27B for Hermes-style agent traces. It is built for agentic conversational use, instruction following, and tool-oriented workflows, and is released as a fully merged standalone checkpoint rather than only a LoRA adapter.

🤖Agents 🐞Debugging 💻Coding 🔍Image interpretation

NewText

Released 2mo ago

Gen 7

LLaDA2.0 Uni

By AntGroup

LLaDA2.0-Uni is Inclusion AI’s unified multimodal diffusion MoE model for both image understanding and image generation. It is built on a dLLM backbone and supports text-to-image, image understanding, image editing, interleaved reasoning, and “thinking mode” image generation in one system.

📷Images 🖌️Image editing 🔍Image interpretation

Text

Released 3mo ago

Gen 3 Qwen

Qwen 3.6 27B

By Alibaba

Qwen3.6-27B is Qwen’s open-weight multimodal model for coding, agent workflows, long-context reasoning, and vision-language tasks. It combines a 27B causal language model with a vision encoder, supports image-text-to-text use, and offers a native 262,144-token context window extendable to about 1.01M tokens.

💬Chatting 💻Coding 🤖Agents 🔍Image interpretation 🎩Sophisticated reasoning 💻Vibe coding

Multimodal

Released 3mo ago

Gen 3 Kimi

Kimi K2.6

By Moonshot AI

Kimi-K2.6 is Moonshot AI’s open-source native multimodal agentic model, built for long-horizon coding, coding-driven design, proactive autonomous execution, and large-scale multi-agent orchestration. It uses a MoE architecture with 1T total parameters, 32B active parameters, a 256K context window, and a MoonViT vision encoder.

💬Chatting 💻Coding 🤖Agents 🔍Image interpretation 🎩Sophisticated reasoning 💻Vibe coding

Multimodal

Released 3mo ago

Gen 3 Claude

Claude Opus 4.7

By Anthropic

Claude Opus 4.7 is Anthropic’s latest generally available frontier model, tuned for advanced software engineering, long-running autonomous tasks, stronger instruction following, and better high-resolution vision. It is positioned as a clear upgrade over Opus 4.6, especially for difficult coding work, while keeping the same pricing.

💬Chatting 💵Financial analysis 💻Coding 🤖Agents 🔍Image interpretation

Multimodal

Released 3mo ago

Gen 3 Qwen

Qwen 3.6 35B A3B

By Alibaba

Qwen3.6-35B-A3B is Qwen’s open-weight multimodal MoE model for coding, agentic workflows, long-context reasoning, and vision-language tasks. It has 35B total parameters with 3B activated, supports image-text-to-text use, preserves reasoning context across turns, and natively handles 262,144 tokens with extension up to about 1.01M.

💻Coding 🤖Agents 🔍Image interpretation 🔢Math 💻Conversational coding

Multimodal

Released 3mo ago

Gen 3 LFM

LFM2.5 VL 450M Extract

By Liquid AI

LFM2.5-VL-1.6B-Extract is Liquid AI’s 1.6B vision-language extraction model for image-to-JSON structured field extraction.

🔍Data extraction 🔍Image interpretation 📜OCR

Multimodal

Released 3mo ago

Gen 4

LFM2.5 VL 450M

By Liquid AI

LFM2.5-VL-450M is Liquid AI’s compact vision-language model for structured visual intelligence from edge to cloud. It is built to turn image streams into grounded, actionable outputs in real time, adding object grounding, better instruction following, multilingual image understanding, and function calling support while staying efficient enough for edge hardware.

🔍Image interpretation 🔍Image recognition 📜OCR

Image

Released 3mo ago

Gen 7

Muse Spark

By Meta Platforms

Muse Spark is Meta Superintelligence Labs’ first model, built as a fast multimodal assistant for everyday use across Meta’s apps and devices. It currently powers the Meta AI app and website, with rollout planned for WhatsApp, Instagram, Facebook, Messenger, and AI glasses, and is positioned as Meta’s most powerful assistant model so far.

💬Chatting 🏥Health 🤖Agents 🤖Ai research assistance 🔍Image interpretation

Text

Released 3mo ago

Gen 3

AURA

By Huawei

AURA is a real-time multimodal streaming system for continuous video understanding with speech interaction. It is built as an always-on assistant over live video streams and is released as an Apache 2.0 project built on top of Qwen3-VL-8B-Instruct.

🎥Video analysis 🔍Image interpretation

Multimodal

Released 3mo ago

Gen 3 Gemma

Gemma 4 31B IT NVFP4

By NVIDIA

Gemma-4-31B-IT-NVFP4 is NVIDIA’s inference-optimized NVFP4 quantized version of Gemma 4 31B IT. It is a commercial-ready multimodal model for text, image, and video understanding with text output, built for reasoning, coding, chat, and agentic workflows while preserving the original model’s long 256K context window.

🔄Language model optimization 🔍Image interpretation 🧠AI inference

Multimodal

Released 3mo ago

Gen 3 Gemma

Gemma 4 12B

By Google DeepMind

Gemma 4 is Google DeepMind’s open-weight model family built from Gemini 3 research, focused on high intelligence-per-parameter, agentic workflows, multimodal reasoning, multilingual use, coding, and efficient local deployment.

💬Chatting 🔎Problem solving 🤖Agents 🤔Logical reasoning 🔍Image interpretation 💻Conversational coding

Multimodal

Released 3mo ago

Gen 3 LongCat

LongCat Next

By Meituan

LongCat Next is a multimodal LongCat model focused on compact yet capable visual and speech understanding. The official intro highlights strong performance despite a 28x compression ratio, with particular strength in text rendering, speech comprehension, low-latency voice conversation, and customizable voice cloning.

🖼️Image generation 🗣️Voice cloning 🔍Image interpretation 🔊Audio

Multimodal

Released 3mo ago

Gen 3

Photon

By Moondream

Photon is Moondream’s real-time vision-language model aimed at production video and image analysis. It is designed to deliver VLM-style visual reasoning fast enough for live use cases such as manufacturing inspection, broadcast moderation, retail monitoring, and security feeds.

🔍Image interpretation 👓Visual assistance 🔍Object identification 👁️Computer vision assistance

Multimodal

Released 3mo ago

Gen 3

Alpamayo 1.5 10B

By NVIDIA

Alpamayo 1.5-10B is NVIDIA’s open 10B vision-language-action model for autonomous driving. It is built as a steerable reasoning engine for AV research, combining multi-camera visual input, text, and egomotion history to produce both chain-of-causation reasoning and future driving trajectories.

🚗Autonomous driving 🔍Advanced reasoning 🔍Image interpretation 👁️Computer vision assistance

Multimodal

Released 4mo ago

Gen 3 MiMo

Xiaomi MiMo V2 Omni

By Xiaomi

MiMo-V2-Omni is an omni foundation model that unifies multimodal understanding with agentic capability, built to see, hear, and act.

📚Large Language Models 🎥Video summaries 📚Audio summaries 🔍Image interpretation

Multimodal

Released 4mo ago

Gen 3 GPT

GPT 5.4 Nano

By OpenAI

GPT-5.4 nano is the smallest, lowest-cost GPT-5.4-family model, optimized for speed and high-throughput tasks.

🔍Data extraction 🎯Code autocompletion 🔍Image interpretation 🔍Data classification

Multimodal

Released 4mo ago

Gen 3 GPT

GPT 5.4 Mini

By OpenAI

GPT-5.4 mini is a fast, efficient GPT-5.4-family model optimized for high-volume coding and agent workloads, while keeping strong reasoning, multimodal understanding, and tool use.

💬Chatting 🎯Code autocompletion 🔍Image interpretation 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3 Mistral

Mistral Small 4

By Mistral AI

Mistral Small 4 is an open hybrid model that unifies instruct, reasoning, and coding in a single multimodal model with a 256k context window.

💬Chatting 🔍Advanced reasoning 🔍Image interpretation 🔢Math 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3

Penguin VL 2B

By Tencent

Penguin-VL-2B is a compact vision-language model that uses an LLM-based vision encoder to push efficiency limits in multimodal reasoning.

🔍Image interpretation 📷Image text extraction 🔍Image and document analysis 📜OCR

Multimodal

Released 4mo ago

Gen 3

Penguin VL 8B

By Tencent

Penguin-VL-2B is a compact vision-language model that uses an LLM-based vision encoder to push efficiency limits in multimodal reasoning.

🔍Image interpretation 📷Image text extraction 🔍Image and document analysis 📜OCR

Multimodal

Released 4mo ago

Gen 3 Qwen

Qwen 3.5 9B

By Alibaba

Qwen3.5-9B is a larger dense vision-language causal model with a vision encoder, targeting stronger capability for multimodal reasoning and agentic use

💬Chatting 🌐Text translation 🔍Image interpretation 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3 Qwen

Qwen 3.5 4B

By Alibaba

Qwen3.5-4B is a mid-size vision-language causal model with a vision encoder, designed for multimodal reasoning, coding, and agent workflows with very long context.

💬Chatting 🌐Text translation 🔍Image interpretation 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3 Qwen

Qwen 3.5 2B

By Alibaba

Qwen3.5-2B is a small vision-language causal model with a vision encoder, aimed at strong multimodal capability with efficient compute.

💬Chatting 🌐Text translation 🔍Image interpretation 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3 Qwen

Qwen 3.5 0.8B

By Alibaba

Qwen3.5-0.8B is a compact vision-language causal model with a vision encoder, built for multimodal understanding and agentic tool use at small scale.

💬Chatting 🌐Text translation 🔍Image interpretation 💻Conversational coding

Multimodal

Released 4mo ago

Gen 3 Mistral

Mistral 3 14b Instruct

By Mistral AI

Ministral 3 14B Instruct 2512 is Mistral’s largest Ministral 3 model, built as an efficient instruction model with vision capabilities. Mistral positions it as delivering frontier-level capability while staying compact enough for local or edge deployment.

💬Chatting 🌐Text translation 🔍Image interpretation 💻Conversational coding

Multimodal

Released 7mo ago

Gen 3 Moondream

Moondream 3 Preview

By Moondream

Moondream 3 Preview is a compact frontier-oriented vision-language model built for fast visual reasoning, grounding, OCR, object detection, pointing, and structured output. It uses a 9B MoE architecture with 2B active parameters and extends context length to 32K, aiming to deliver strong real-world vision performance while staying efficient and inexpensive to run.

🔍Image interpretation 🔍Image recognition 🖼️Image segmentation 📜OCR

Multimodal

Released 10mo ago

Gen 3

BitVLA

By ustcwhy

BitVLA is a 1-bit vision-language-action model for robotic manipulation designed to run efficiently on memory-constrained edge platforms.

🤖Robotics 📊Robotics data analysis 🔍Image interpretation 🚀Performance optimization

Multimodal

Released 1y ago

Gen 4 Moondream

Moondream 0.5B

By Moondream

Moondream 0.5B is a tiny open-source vision-language model built for edge devices and mobile platforms. With only 0.5B parameters, it is positioned as the world’s smallest VLM, designed for fast lightweight deployment on constrained hardware while still supporting practical real-world visual tasks.

🔍Image interpretation 🔍Image recognition

Image

Released 1y ago

Gen 3 Moondream

Moondream 1

By Moondream

Small, efficient open-source vision-language model designed to run broadly on many devices.

🔍Image interpretation 🖼️Image descriptions

Multimodal

Released 1y ago