TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Vision Language Models Overview

zli12321 / Vision-Language-Models-Overview

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.

540 31 Language: null Updated: 3mo ago

README

Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

  • State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
  • Evaluate VLM benchmarks and corresponding link to the works
  • Post-training/Alignment Newest related work for VLM alignment including RL, sft.
  • Applications applications of VLMs in embodied AI, robotics, etc.
  • Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!


๐Ÿคฉ Papers marked with a โญ๏ธ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.


Table of Contents

0. <a name='Citations'></a>Citation

@InProceedings{Li_2025_CVPR,
    author    = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
    title     = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2025},
    pages     = {1587-1606}
}

1. <a name='vlms'></a>๐Ÿ“š SoTA VLMs

Model Year Architecture Training Data Parameters Vision Encoder/Tokenizer Pretrained Backbone Model
Erin 5.0 (Baidu) 02/05/2026 Unified Model (Visual, Text, Audio) Unified Modality Dataset - CNNโ€“ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation) Unified Autoregressive Transformer
Gemini 3 11/18/2025 Unified Model Undisclosed - - -
Emu3.5 10/30/2025 Deconder-only Unified Modality Dataset - SigLIP Qwen3
DeepSeek-OCR 10/20/2025 Encoder-Deconder 70% OCR, 20% general vision, 10% text-only 3B DeepEncoder DeepSeek-3B
Qwen3-VL 10/11/2025 Decoder-Only - 8B/4B ViT Qwen3
Qwen3-VL-MoE 09/25/2025 Decoder-Only - 235B-A22B ViT Qwen3
Qwen3-Omni (Visual/Audio/Text) 09/21/2025 - Video/Audio/Image 30B ViT Qwen3-Omni-MoE-Thinker
LLaVA-Onevision-1.5 09/15/2025 - Mid-Training-85M & SFT 8B Qwen2VLImageProcessor Qwen3
InternVL3.5 08/25/2025 Decoder-Only multimodal & text-only 30B/38B/241B InternViT-300M/6B Qwen3 / GPT-OSS
SkyWork-Unipic-1.5B 07/29/2025 - image/video.. - - -
Grok 4 07/09/2025 - image/video.. 1-2 Trillion - -
Kwai Keye-VL (Kuaishou) 07/02/2025 Decdoer-only image/video.. 8B ViT QWen-3-8B
OmniGen2 06/23/2025 Decdoer-only & VAE LLaVA-OneVision/ SAM-LLaVA.. - ViT QWen-2.5-VL
Gemini-2.5-Pro 06/17/2025 - - - - -
GPT-o3/o4-mini 06/10/2025 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
Mimo-VL (Xiaomi) 06/04/2025 Decdoer-only 24 Trillion MLLM tokens 7B [Qwen2.5-ViT Mimo-7B-base
BAGEL (Bytedance) 05/20/2025 Unified Model Video/Image/Text 7B SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786) Qwen2.5
BLIP3-o 05/14/2025 Decdoer-only (BLIP3-o 60K) GPT-4o Generated Image Generation Data 4/8B ViT QWen-2.5-VL
InternVL-3 04/14/2025 Decdoer-only 200 Billion Tokens 1/2/8/9/14/38/78B ViT-300M/6B InterLM2.5/QWen2.5
LLaMA4-Scout/Maverick 04/04/2025 Decdoer-only 40/20 Trillion Tokens 17B MetaClip LLaMA4
Qwen2.5-Omni 03/26/2025 Decdoer-only Video/Audio/Image/Text 7B Qwen2-Audio/Qwen2.5-VL ViT End-to-End Mini-Omni
QWen2.5-VL 01/28/2025 Decdoer-only Image caption, VQA, grounding agent, long video 3B/7B/72B Redesigned ViT Qwen2.5
Ola 2025 Decoder-only Image/Video/Audio/Text 7B OryxViT Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2)
Ocean-OCR 2025 Decdoer-only Pure Text, Caption, Interleaved, OCR 3B NaViT Pretrained from scratch
SmolVLM 2025 Decoder-only SmolVLM-Instruct 250M & 500M SigLIP SmolLM
DeepSeek-Janus-Pro 2025 Decoder-only Undisclosed 7B SigLIP DeepSeek-Janus-Pro
Inst-IT 2024 Decoder-only Inst-IT Dataset, LLaVA-NeXT-Data 7B CLIP/Vicuna, SigLIP/Qwen2 LLaVA-NeXT
DeepSeek-VL2 2024 Decoder-only WiT, WikiHow 4.5B x 74 SigLIP/SAMB DeepSeekMoE
xGen-MM (BLIP-3) 2024 Decoder-only MINT-1T, OBELICS, Caption 4B ViT + Perceiver Resampler Phi-3-mini
TransFusion 2024 Encoder-decoder Undisclosed 7B VAE Encoder Pretrained from scratch on transformer architecture
Baichuan Ocean Mini 2024 Decoder-only Image/Video/Audio/Text 7B CLIP ViT-L/14 Baichuan
LLaMA 3.2-vision 2024 Decoder-only Undisclosed 11B-90B CLIP LLaMA-3.1
Pixtral 2024 Decoder-only Undisclosed 12B CLIP ViT-L/14 Mistral Large 2
Qwen2-VL 2024 Decoder-only Undisclosed 7B-14B EVA-CLIP ViT-L Qwen-2
NVLM 2024 Encoder-decoder LAION-115M 8B-24B Custom ViT Qwen-2-Instruct
Emu3 2024 Decoder-only Aquila 7B MoVQGAN LLaMA-2
Claude 3 2024 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
InternVL 2023 Encoder-decoder LAION-en, LAION- multi 7B/20B Eva CLIP ViT-g QLLaMA
InstructBLIP 2023 Encoder-decoder CoCo, VQAv2 13B ViT Flan-T5, Vicuna
CogVLM 2023 Encoder-decoder LAION-2B ,COYO-700M 18B CLIP ViT-L/14 Vicuna
PaLM-E 2023 Decoder-only All robots, WebLI 562B ViT PaLM
LLaVA-1.5 2023 Decoder-only COCO 13B CLIP ViT-L/14 Vicuna
Gemini 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
GPT-4V 2023 Decoder-only Undisclosed Undisclosed Undisclosed Undisclosed
BLIP-2 2023 Encoder-decoder COCO, Visual Genome 7B-13B ViT-g Open Pretrained Transformer (OPT)
Flamingo 2022 Decoder-only M3W, ALIGN 80B Custom Chinchilla
BLIP 2022 Encoder-decoder COCO, Visual Genome 223M-400M ViT-B/L/g Pretrained from scratch
CLIP 2021 Encoder-decoder 400M image-text pairs 63M-355M ViT/ResNet Pretrained from scratch

2. <a name='Dataset'></a>๐Ÿ—‚๏ธ Benchmarks and Evaluation

2.1. <a name='TrainingDatasetforVLM'></a> Datasets for Training VLMs

Dataset Task Sizeโ€ฏ
MMFineReason(/1/30/2026) REasoning 1.8M
FineVision(09/04/2025) Mixed Domain 24.3 M/4.48TB

2.2. <a name='DatasetforVLM'></a> Datasets and Evaluation for VLM

๐Ÿงฎโ€ฏVisualย Mathโ€ฏ(+โ€ฏVisualโ€ฏMathโ€ฏReasoning)

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
MathVision Visualโ€ฏMath MC /ย Answerโ€ฏMatch Human โ€ฏ3.04 Repo
MathVista Visualโ€ฏMath MC /ย Answerโ€ฏMatch Human โ€ฏ6 Repo
MathVerse Visualโ€ฏMath MC Human โ€ฏ4.6 Repo
VisNumBench Visualโ€ฏNumberโ€ฏReasoning MC Python Program generated/Web Collection/Real life photos โ€ฏ1.91 Repo

๐Ÿ’ฌโ€ฏBenchmark for Unified Models

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
RealUnify Math, World knowledge, Image Gen Direct & StepWise Eval (Sec 3.3) Script & Humanverification โ€ฏ1.0 Repo
Uni-MMMU Science, Code, Image Gen DreamSim (Image Gen Eval) & String Matching (Understanding Eval) - โ€ฏ1.0 Repo

๐ŸŽž๏ธโ€ฏVideoย Understanding

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
VideoHallu Videoโ€ฏUnderstanding LLMโ€ฏEval Human โ€ฏ3.2 Repo
Videoย SimpleQA Videoโ€ฏUnderstanding LLMโ€ฏEval Human โ€ฏ2.03 Repo
MovieChat Videoโ€ฏUnderstanding LLMโ€ฏEval Human โ€ฏ1 Repo
Perceptionโ€‘Test Videoโ€ฏUnderstanding MC Crowd โ€ฏ11.6 Repo
VideoMME Videoโ€ฏUnderstanding MC Experts โ€ฏ2.7 Site
EgoSchem Videoโ€ฏUnderstanding MC Synthโ€ฏ/โ€ฏHuman โ€ฏ5 Site
Instโ€‘ITโ€‘Bench Fineโ€‘grainedย Imageโ€ฏ&โ€ฏVideo MCโ€ฏ&โ€ฏLLM Humanโ€ฏ/โ€ฏSynth โ€ฏ2 Repo

๐Ÿ’ฌโ€ฏMultimodalโ€ฏConversation

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
VisionArena Multimodalโ€ฏConversation Pairwiseโ€ฏPref Human โ€ฏ23 Repo

๐Ÿง โ€ฏMultimodalย Generalย Intelligence

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
MMLU Generalย MM MC Human โ€ฏ15.9 Repo
MMStar Generalย MM MC Human โ€ฏ1.5 Site
NaturalBench Generalย MM Yes/No,โ€ฏMC Human โ€ฏ10 HF
PHYSBENCH Visualโ€ฏMathโ€ฏReasoning MC Gradโ€ฏSTEM โ€ฏ0.10 Repo

๐Ÿ”Žโ€ฏVisualโ€ฏReasoningย /ย VQAย (+โ€ฏMultilingualโ€ฏ&โ€ฏOCR)

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
EMMA Visualโ€ฏReasoning MC Human + Synth โ€ฏ2.8 Repo
MMTBENCH Visualโ€ฏReasoning & QA MC AIโ€ฏExperts โ€ฏ30.1 Repo
MMโ€‘Vet OCRโ€ฏ/โ€ฏVisualโ€ฏReasoning LLMโ€ฏEval Human โ€ฏ0.2 Repo
MMโ€‘En/CN Multilingualย MMโ€ฏUnderstanding MC Human โ€ฏ3.2 Repo
GQA Visualโ€ฏReasoning & QA Answerโ€ฏMatch Seedโ€ฏ+โ€ฏSynth โ€ฏ22 Site
VCR Visualโ€ฏReasoning & QA MC MTurks โ€ฏ290 Site
VQAv2 Visualโ€ฏReasoning & QA Yes/No,โ€ฏAnsโ€ฏMatch MTurks โ€ฏ1100 Repo
MMMU Visualโ€ฏReasoning & QA Ansโ€ฏMatch,โ€ฏMC College โ€ฏ11.5 Site
MMMU-Pro Visualโ€ฏReasoning & QA Ansโ€ฏMatch,โ€ฏMC College โ€ฏ5.19 Site
R1โ€‘Onevision Visualโ€ฏReasoning & QA MC Human โ€ฏ155 Repo
VLMยฒโ€‘Bench Visualโ€ฏReasoning & QA Ansโ€ฏMatch,โ€ฏMC Human โ€ฏ3 Site
VisualWebInstruct Visualโ€ฏReasoning & QA LLMโ€ฏEval Web โ€ฏ0.9 Site

๐Ÿ“โ€ฏVisualย Textโ€ฏ/โ€ฏDocumentโ€ฏUnderstandingย (+โ€ฏCharts)

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
TextVQA Visualโ€ฏTextโ€ฏUnderstanding Ansโ€ฏMatch Expert โ€ฏ28.6 Repo
DocVQA Documentโ€ฏVQA Ansโ€ฏMatch Crowd โ€ฏ50 Site
ChartQA Chartโ€ฏGraphicโ€ฏUnderstanding Ansโ€ฏMatch Crowdโ€ฏ/โ€ฏSynth โ€ฏ32.7 Repo

๐ŸŒ„โ€ฏTextโ€‘toโ€‘Imageย Generation

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
MSCOCOโ€‘30K Textโ€‘toโ€‘Image BLEU,โ€ฏROUGE,โ€ฏSim MTurks โ€ฏ30 Site
GenAIโ€‘Bench Textโ€‘toโ€‘Image Humanย Rating Human โ€ฏ80 HF

๐Ÿšจโ€ฏHallucinationย Detectionโ€ฏ/โ€ฏControl

Dataset Task Eval Protocol Annotators Sizeโ€ฏ(K) Code / Site
HallusionBench Hallucination Yes/No Human โ€ฏ1.13 Repo
POPE Hallucination Yes/No Human โ€ฏ9 Repo
CHAIR Hallucination Yes/No Human โ€ฏ124 Repo
MHalDetect Hallucination Ansโ€ฏMatch Human โ€ฏ4 Repo
Halluโ€‘Pi Hallucination Ansโ€ฏMatch Human โ€ฏ1.26 Repo
HallEโ€‘Control Hallucination Yes/No Human โ€ฏ108 Repo
AutoHallusion Hallucination Ansโ€ฏMatch Synth โ€ฏ3.129 Repo
BEAF Hallucination Yes/No Human โ€ฏ26 Site
GAIVE Hallucination Ansโ€ฏMatch Synth โ€ฏ320 Repo
HalEval Hallucination Yes/No Crowdโ€ฏ/โ€ฏSynth โ€ฏ2 Repo
AMBER Hallucination Ansโ€ฏMatch Human โ€ฏ15.22 Repo

2.3. <a name='DatasetforEmbodiedVLM'></a> Benchmark Datasets, Simulators, and Generative Models for Embodied VLM

Benchmark Domain Type Project
Drive-Bench Embodied AI Autonomous Driving Website
Habitat, Habitat 2.0, Habitat 3.0 Robotics (Navigation) Simulator + Dataset Website
Gibson Robotics (Navigation) Simulator + Dataset Website, Github Repo
iGibson1.0, iGibson2.0 Robotics (Navigation) Simulator + Dataset Website, Document
Isaac Gym Robotics (Navigation) Simulator Website, Github Repo
Isaac Lab Robotics (Navigation) Simulator Website, Github Repo
AI2THOR Robotics (Navigation) Simulator Website, Github Repo
ProcTHOR Robotics (Navigation) Simulator + Dataset Website, Github Repo
VirtualHome Robotics (Navigation) Simulator Website, Github Repo
ThreeDWorld Robotics (Navigation) Simulator Website, Github Repo
VIMA-Bench Robotics (Manipulation) Simulator Website, Github Repo
VLMbench Robotics (Manipulation) Simulator Github Repo
CALVIN Robotics (Manipulation) Simulator Website, Github Repo
GemBench Robotics (Manipulation) Simulator Website, Github Repo
WebArena Web Agent Simulator Website, Github Repo
UniSim Robotics (Manipulation) Generative Model, World Model Website
GAIA-1 Robotics (Automonous Driving) Generative Model, World Model Website
LWM Embodied AI Generative Model, World Model Website, Github Repo
Genesis Embodied AI Generative Model, World Model Github Repo
EMMOE Embodied AI Generative Model, World Model Paper
RoboGen Embodied AI Generative Model, World Model Website
UnrealZoo Embodied AI (Tracking, Navigation, Multi Agent) Simulator Website

3. <a name='posttraining'></a>โš’๏ธ Post-Training

3.1. <a name='alignment'></a>RL Alignment for VLM

Title Year Paper RL Code
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning 10/12/2025 Paper GRPO -
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play 09/29/2025 Paper GRPO -
Vision-SR1: Self-rewarding vision-language model via reasoning decomposition 08/26/2025 Paper GRPO -
Group Sequence Policy Optimization 06/24/2025 Paper GSPO -
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning 05/20/2025 Paper GRPO -
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning 2025/04/10 Paper GRPO Code
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement 2025/03/21 Paper GRPO Code
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning 2025/03/10 Paper GRPO Code
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference 2025 Paper DPO Code
Multimodal Open R1/R1-Multimodal-Journey 2025 - GRPO Code
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization 2025 Paper GRPO Code
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning 2025 - PPO/REINFORCE++/GRPO Code
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning 2025 Paper REINFORCE Leave-One-Out (RLOO) Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment 2025 Paper DPO Code
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL 2025 Paper PPO Code
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models 2025 Paper GRPO Code
Unified Reward Model for Multimodal Understanding and Generation 2025 Paper DPO Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step 2025 Paper DPO Code
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning 2025 Paper Online RL -
Video-R1: Reinforcing Video Reasoning in MLLMs 2025 Paper GRPO Code

3.2. <a name='sft'></a>Finetuning for VLM

Title Year Paper Website Code
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models 2025/04/21 Paper Website Code
OMNICAPTIONER: One Captioner to Rule Them All 2025/04/09 Paper Website Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning 2024 Paper Website Code
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression 2024 Paper Website Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era 2024 Paper Website Code
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model 2024 Paper - -
Should VLMs be Pre-trained with Image Data? 2025 Paper - -
VisionArena: 230K Real World User-VLM Conversations with Preference Labels 2024 Paper - Code

3.3. <a name='vlm_github'></a>VLM Alignment github

Project Repository Link
Verl ๐Ÿ”— GitHub
EasyR1 ๐Ÿ”— GitHub
OpenR1 ๐Ÿ”— GitHub
LLaMAFactory ๐Ÿ”— GitHub
MM-Eureka-Zero ๐Ÿ”— GitHub
MM-RLHF ๐Ÿ”— GitHub
LMM-R1 ๐Ÿ”— GitHub

3.4. <a name='vlm_prompt_engineering'></a>Prompt Optimization

Title Year Paper Website Code
In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer 2025/04/30 Paper Website Code

4. <a name='Toolenhancement'></a> โš’๏ธ Applications

4.1 Embodied VLM Agents

Title Year Paper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI 2024 Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding 2024 Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation 2023 Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement 2024 ๐Ÿ“„ Paper
Training a Vision Language Model as Smartphone Assistant 2024 Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent 2024 Paper
Embodied Vision-Language Programmer from Environmental Feedback 2024 Paper
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method 2025 ๐Ÿ“„ Paper
MP-GUI: Modality Perception with MLLMs for GUI Understanding 2025 ๐Ÿ“„ Paper

4.2. <a name='GenerativeVisualMediaApplications'></a>Generative Visual Media Applications

Title Year Paper Website Code
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning 2023 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Spurious Correlation in Multimodal LLMs 2025 ๐Ÿ“„ Paper - -
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat 2025 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code

4.3. <a name='RoboticsandEmbodiedAI'></a>Robotics and Embodied AI

Title Year Paper Website Code
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
Vision-language model-driven scene understanding and robotic object manipulation 2024 ๐Ÿ“„ Paper - -
Guiding Long-Horizon Task and Motion Planning with Vision Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model 2024 ๐Ÿ“„ Paper - -
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
MotionGPT: Human Motion as a Foreign Language 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment 2024 ๐Ÿ“„ Paper - -
Language to Rewards for Robotic Skill Synthesis 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
Eureka: Human-Level Reward Design via Coding Large Language Models 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
Integrated Task and Motion Planning 2020 ๐Ÿ“„ Paper - -
Jailbreaking LLM-Controlled Robots 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
Robots Enact Malignant Stereotypes 2022 ๐Ÿ“„ Paper ๐ŸŒ Website -
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions 2024 ๐Ÿ“„ Paper - -
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code & Dataset
Gemini Robotics: Bringing AI into the Physical World 2025 ๐Ÿ“„ Technical Report ๐ŸŒ Website -
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
Magma: A Foundation Model for Multimodal AI Agents 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
DayDreamer: World Models for Physical Robot Learning 2022 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models 2025 ๐Ÿ“„ Paper - -
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback 2024 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data 2024 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Unified Video Action Model 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code

4.3.1. <a name='Manipulation'></a>Manipulation

Title Year Paper Website Code
VIMA: General Robot Manipulation with Multimodal Prompts 2022 ๐Ÿ“„ Paper ๐ŸŒ Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model 2023 ๐Ÿ“„ Paper - -
Creative Robot Tool Use with Large Language Models 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics 2024 ๐Ÿ“„ Paper - -
RT-1: Robotics Transformer for Real-World Control at Scale 2022 ๐Ÿ“„ Paper ๐ŸŒ Website -
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
Open X-Embodiment: Robotic Learning Datasets and RT-X Models 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Masked World Models for Visual Control 2022 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Multi-View Masked World Models for Visual Robotic Manipulation 2023 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code

4.3.2. <a name='Navigation'></a>Navigation

Title Year Paper Website Code
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings 2022 ๐Ÿ“„ Paper - -
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation 2024 ๐Ÿ“„ Paper - -
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action 2022 ๐Ÿ“„ Paper ๐ŸŒ Website -
NaVILA: Legged Robot Vision-Language-Action Model for Navigation 2022 ๐Ÿ“„ Paper ๐ŸŒ Website -
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation 2024 ๐Ÿ“„ Paper - -
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments 2025 ๐Ÿ“„ Paper - -
Navigation World Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -

4.3.3. <a name='HumanRobotInteraction'></a>Human-robot Interaction

Title Year Paper Website Code
MUTEX: Learning Unified Policies from Multimodal Task Specifications 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models 2024 ๐Ÿ“„ Paper - -

4.3.4. <a name='AutonomousDriving'></a>Autonomous Driving

Title Year Paper Website Code
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives 01/07/2025 ๐Ÿ“„ Paper ๐ŸŒ Website -
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
GPT-Driver: Learning to Drive with GPT 2023 ๐Ÿ“„ Paper - -
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving 2023 ๐Ÿ“„ Paper - -
Referring Multi-Object Tracking 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling 2023 ๐Ÿ“„ Paper - -
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models 2023 ๐Ÿ“„ Paper ๐ŸŒ Website -
VLP: Vision Language Planning for Autonomous Driving 2024 ๐Ÿ“„ Paper - -
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model 2023 ๐Ÿ“„ Paper - -

4.4. <a name='Human-CenteredAI'></a>Human-Centered AI

Title Year Paper Website Code
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration โ€“ A Robot Sous-Chef Application 2024 ๐Ÿ“„ Paper - -
Pretrained Language Models as Visual Planners for Human Assistance 2023 ๐Ÿ“„ Paper - -
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research 2024 ๐Ÿ“„ Paper - -
Image and Data Mining in Reticular Chemistry Using GPT-4V 2023 ๐Ÿ“„ Paper - -

4.4.1. <a name='WebAgent'></a>Web Agent

Title Year Paper Website Code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis 2023 ๐Ÿ“„ Paper - -
CogAgent: A Visual Language Model for GUI Agents 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code

4.4.2. <a name='Accessibility'></a>Accessibility

Title Year Paper Website Code
X-World: Accessibility, Vision, and Autonomy Meet 2021 ๐Ÿ“„ Paper - -
Context-Aware Image Descriptions for Web Accessibility 2024 ๐Ÿ“„ Paper - -
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models 2024 ๐Ÿ“„ Paper - -

4.4.3. <a name='Medical and Healthcare'></a>Healthcare

Title Year Paper Website Code
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning 12/2025 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Frontiers in Intelligent Colonoscopy 02/2025 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology 2024 ๐Ÿ“„ Paper - -
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization 2023 ๐Ÿ“„ Paper - -
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text 2022 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code

4.4.4. <a name='SocialGoodness'></a>Social Goodness

Title Year Paper Website Code
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy 2024 ๐Ÿ“„ Paper - -
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence 2024 ๐Ÿ“„ Paper - -
Harnessing Large Vision and Language Models in Agriculture: A Review 2024 ๐Ÿ“„ Paper - -
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping 2024 ๐Ÿ“„ Paper - -
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Studentsโ€™ Hand-Drawn Math Images 2024 ๐Ÿ“„ Paper - -
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation 2021 ๐Ÿ“„ Paper - -
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling 2024 ๐Ÿ“„ Paper - -

5. <a name='Challenges'></a>Challenges

5.1 <a name='Hallucination'></a>Hallucination

Title Year Paper Website Code
Object Hallucination in Image Captioning 2018 ๐Ÿ“„ Paper - -
Evaluating Object Hallucination in Large Vision-Language Models 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Detecting and Preventing Hallucinations in Large Vision Language Models 2023 ๐Ÿ“„ Paper - -
HallE-Control: Controlling Object Hallucination in Large Multimodal Models 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation 2023 ๐Ÿ“„ Paper - ๐Ÿ’พ Code

5.2 <a name='Safety'></a>Safety

Title Year Paper Website Code
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models 2024 ๐Ÿ“„ Paper ๐ŸŒ Website -
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments 2023 ๐Ÿ“„ Paper - -
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models 2024 ๐Ÿ“„ Paper - -
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks 2024 ๐Ÿ“„ Paper - -
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models 2024 ๐Ÿ“„ Paper - ๐Ÿ’พ Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models 2024 ๐Ÿ“„ Paper - -
Jailbreaking Attack against Multimodal Large Language Model 2024 ๐Ÿ“„ Paper - -
Embodied Red Teaming for Auditing Robotic Foundation Models 2025 ๐Ÿ“„ Paper ๐ŸŒ Website ๐Ÿ’พ Code
Safety Guardrails for LLM-Enabled Robots 2025 ๐Ÿ“„ Paper - -

5.3 <a name='Fairness'></a>Fairness

Title Year Paper Website Code
Hallucination of Multimodal Large Language Models: A Survey
0 AIs selected
Clear selection
#
Name
Task