Image recognition

SAM 3.1 is Meta’s improved promptable segmentation model for images and video. It supports points, boxes, masks, text, and exemplar prompts, and is designed to segment and track objects more accurately than earlier SAM 3 releases, including open-vocabulary concepts across frames.

🖼️Image segmentation 🔍Image recognition 🎥Video analysis 🔍Object identification

NewMultimodal

Released 3d ago

Gen 3

Wholembed v3

By mixedbread ai

Wholembed v3 is Mixedbread’s unified omnimodal, multilingual late-interaction retrieval model built for state-of-the-art search across languages and modalities.

🔍Information retrieval 🌐Text translation 🔍Image recognition 🔍Image search

NewMultimodal

Released 18d ago

Gen 7

OmniScient Model

By ByteDance

OmniScient Model (OSM) is an open-ended visual recognition approach that predicts free-form class labels for visual entities without requiring a predefined vocabulary at test time.

🔍Image recognition 🖼️Image segmentation 🔍Object identification 👁️Computer vision assistance

NewText

Released 1mo ago

Gen 3

moondream 2b

By Moondream

Moondream 2B is a compact vision-language model variant designed for efficient image understanding and instruction-following with reduced memory usage.

🖼️Image generation 🔍Code reviews 🔍Image recognition

NewMultimodal

Released 1mo ago

Gen 3

Carbon Robotics Large Plant Model

By Carbon Robotics

Large Plant Model (LPM) is Carbon Robotics’ vision model for agriculture, trained on 150M labeled plants to recognize crops and weeds in many climates, powering Carbon AI, LaserWeeder and AutoTractor for precise autonomous weed control.

🌿Plant identification 🔍Image recognition

NewMultimodal

Released 1mo ago

Gen 4

VIGA

By Fugtemypt123

VIGA is a vision-as-inverse-graphics agent that rebuilds a single image as an editable 3D Blender scene, alternating generator and verifier roles with interleaved multimodal reasoning to capture objects, layout, physics and interactions.

🌍3D images 🔍Image recognition 🎥YouTube thumbnails

NewImage

Released 2mo ago

Gen 4

SHARP

By Apple

SHARP is Apple’s monocular view-synthesis model that regresses a 3D Gaussian scene from one photo in under a second on a standard GPU, enabling real-time, photorealistic nearby views with metric camera motion.

🖼️Image generation 🔍Image recognition

Image

Released 3mo ago

Gen 4

Precision V2

By Generative Suite

Precision V2 refines V1 with cleaner micro-texture and steadier small-text legibility at similar speed.

🔍Image upscaling 🖌️Image editing 🔍Image recognition

Image

Released 5mo ago

Gen 4 Hunyuan

HunyuanWorld Mirror

By Tencent

HunyuanWorld Mirror is a scene-reconstruction and world-modeling system. It turns photos and videos into a consistent digital twin that you can explore, edit, and render, with export to common 3D formats for simulation, virtual production, and design.

🖼️Image generation 🔍Image recognition

Image

Released 5mo ago

Gen 4

Crystal Upscaler

By ClarityAI

Crystal Upscaler is an image super-resolution and enhancement model that enlarges 2x to 8x while restoring detail, reducing noise, and fixing compression artifacts. It works on photos, renders, anime, and UI art with controllable sharpness and texture preservation.

🔍Image upscaling 🖌️Image editing 📝Social media bios 🔍Image recognition

Image

Released 5mo ago

Gen 3

FlowRVS

By xmz111

FlowRVS is a referring video object segmentation method that learns a text-conditioned continuous flow to deform a video’s spatiotemporal representation into the target object mask.

🖼️Image segmentation 🔍Image recognition 🎥Video analysis 👁️Computer vision assistance

Multimodal

Released 5mo ago

Gen 3

CameraTrapAI

By Google

Google-provided AI models for classifying wildlife species in camera trap images.

🔍Image recognition 🧝‍♀️Fantasy images

Multimodal

Released 5mo ago

Gen 4

Pixel Perfect Depth

By Xiaomi

Pixel-Perfect Depth is a monocular depth estimation model that uses pixel-space diffusion transformers to predict high-quality, flying-pixel-free depth maps for dense point clouds, accepted at NeurIPS 2025.

🖼️Image generation 🔍Image upscaling 🔍Image recognition

Image

Released 5mo ago

Gen 4 Hunyuan

HunyuanImage 3.0

By Tencent

HunyuanImage 3.0 is Tencent’s next-gen text-to-image model. It delivers sharper detail, stronger style and identity consistency, improved typography, and precise, in-place editing—built for fast iteration from concept to production-ready visuals.

🖼️Image generation 🖌️Image editing 💡Coding help 🔍Image recognition

Image

Released 6mo ago

Gen 3 Qianfan

Qianfan-VL-3B

By Baidu

Qianfan-VL-3B is Baidu’s lightweight VLM for cost-sensitive, real-time multimodal apps. It processes images plus text and returns grounded answers with basic OCR and layout understanding, long context, tool/function calling, and JSON outputs—optimized for speed and efficiency.

🏭Manufacturing 🖼️Image to text 🔍Image recognition

Text

Released 6mo ago

Gen 3 Qianfan

Qianfan VL 70B

By Baidu

Qianfan-VL 70B is Baidu’s large vision-language model on the Qianfan platform. It ingests images (docs, charts, screenshots, photos) with text and produces grounded answers, featuring strong OCR and layout understanding, long context, tool/function calling, streaming, and reliable JSON outputs for multimodal RAG and enterprise apps.

📜OCR 🖼️3D image generation 🎬Video dubbing 🔍Image recognition

Text

Released 6mo ago

Gen 3 Command

Command A Vision

By Caldera Labs

Command A Vision is Cohere’s multimodal instruction model that pairs text and image understanding. It accepts images plus text prompts and outputs structured, step-by-step text answers. It’s tuned for enterprise workflows like document OCR, chart/diagram reasoning, screenshot/UI analysis, and tool or function calling.

📜OCR 🖼️Image to text 🔍Image recognition

Text

Released 7mo ago

Gen 3 Kanana

Kanana 1.5

By Naver

Kanana-1.5-v-3B is a 3B-parameter vision–language model in Kakao’s Kanana line. It can process both images and text prompts, outputting grounded answers in natural language or structured JSON. It’s optimized for lightweight multimodal assistants and enterprise applications that need efficiency with visual reasoning.

📜OCR 🔍Image recognition

Text

Released 8mo ago

Gen 4 Earth

Earth 2 FourCastNet 3

By NVIDIA

Earth-2 FourCastNet 3 is a geometric ML global ensemble model that respects spherical Earth geometry to deliver fast, probabilistic medium- to subseasonal forecasts, outperforming leading numerical ensembles at much lower cost.

❤️Empathetic conversations 🔍Image recognition 💥Bdsm education

Image

Released 8mo ago

Gen 4

Hunyuan HY-T1

By Alibaba

Hunyuan T1 is Tencent’s deep reasoning model positioned for stronger structured reasoning and long-context analysis.

🖼️Image generation 🔍Image recognition

Image

Released 1y ago

Gen 3 Qwen

Qwen 2.5-VL-72B

By Alibaba

Qwen 2.5-VL-72B is Alibaba’s flagship open-weight vision-language model. It takes images (docs, charts, screenshots, photos) plus text and answers in text, with strong OCR, layout understanding, and multi-image reasoning. It supports long context, function/tool calling, and reliable JSON outputs—ideal for multimodal RAG, agents, and enterprise workflows.

🏭Manufacturing 🖼️Image to text 🔍Image recognition

Text

Released 1y ago

Gen 3 Gemma

PaliGemma 2

By Google

PaliGemma 2 is Google’s next-gen open-weight vision-language model in the Gemma family. It takes images (docs, charts, screenshots, photos) plus text and answers in text—with stronger OCR, grounded visual reasoning, multi-image understanding, and easy fine-tuning for real apps on a single GPU or edge devices.

🔍Image recognition 🔍SEO content

Text

Released 1y ago

Gen 4

Zen

By Freepik Company

Zen aims for minimalist, calm compositions with natural lighting and restrained color.

🖼️Image generation 🔍Image recognition 🔄Website conversion 📐Blueprints

Image

Released 1y ago

Gen 3

NV-CLIP

By NVIDIA

NV-CLIP is NVIDIA’s CLIP-style vision–language encoder that maps images and text into a shared embedding space for visual search, cross-modal retrieval, and zero-shot classification. It’s optimized for NVIDIA GPUs and easy to deploy at scale.

🎨NFT art 🔊Text to speech 🔍Image recognition

Text

Released 1y ago

Gen 6 LLaMA

Llama 3.2

By Meta Platforms

Includes lightweight text models (1B, 3B for edge/mobile, 128k context) and vision models (11B, 9...

📷Images 🔍Image recognition

Text

Released 1y ago

Gen 4

OmniParser

By Microsoft

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

🔍Image recognition 🖼️Image generation

Image

Released 1y ago

Gen 7 Gemma

PaliGemma

By Google

PaliGemma is Google’s open-weight vision-language model in the Gemma family. It takes images (or screenshots, documents, charts) plus text and answers in text—great for OCR, captioning, VQA, and UI/doc understanding. Lightweight and fine-tunable, it runs on a single GPU and supports quantization for edge deployment.

📜OCR 🖼️Image to text 🔍Image recognition

Text

Released 1y ago

Gen 3 Palmyra

Palmyra Vision

By Writer Engineering

Palmyra Vision is Writer’s multimodal LLM that takes images as input and generates text output. It can extract text from images (including handwriting), interpret charts/graphs/diagrams, classify objects, and answer questions about visual content—all aimed at enterprise workflows.

🖼️Image to text 🔍Image recognition 🖼️Image descriptions

Text

Released 2y ago

Gen 4

Lightning XL

By Leonardo Interactive

Lightning XL is a speed optimized SDXL checkpoint that produces strong images in very few steps, ideal for rapid iteration.

🖼️Image generation 🔍Image upscaling 🔍Image recognition

Image

Released 2y ago

Gen 4

Ultralytics YOLO

By Ultralytics

Ultralytics YOLO is a family of real-time computer-vision models for detection, segmentation, classification, pose, and tracking, designed to be fast, accurate, and easy to deploy across edge and cloud.

🔍Image recognition

Image

Released 3y ago

Search

Image recognition

Specialized tools 2

Related Tasks✕

Models 30

Search

Image recognition

Specialized tools 2

Related Tasks✕

Models 30

Help

People also viewed

Task Options

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: