Vision Language Models Overview

zli12321 / Vision-Language-Models-Overview

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.

540 31 Language: null Updated: 3mo ago

blip2 claude clip deepseek finevision-pretrain-dataset gemini-pro gpt-4v llama-vision-model llava multimodal-benchmarks multimodal-models qwen-vl reinforcement-learning sota-model vision-language-model-applications vision-language-models world-models

📚Large Language Models 🔍SEO content 🧠Mind maps

README

Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

Below we compile awesome papers and model and github repositories that

State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
Evaluate VLM benchmarks and corresponding link to the works
Post-training/Alignment Newest related work for VLM alignment including RL, sft.
Applications applications of VLMs in embodied AI, robotics, etc.
Contribute surveys, perspectives, and datasets on the above topics.

Welcome to contribute and discuss!

🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.

📄 Paper Link/⛑️ Citation
1. 📚 SoTA VLMs
1. 🗂️ Dataset and Evaluation
1. 🔥 Post-Training/Alignment/prompt engineering 🔥
- 3.1. RL Alignment for VLM
- 3.2. Regular finetuning (SFT)
- 3.3. VLM Alignment Github
- 3.4. Prompt Engineering
1. ⚒️ Applications
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
  - 4.3.1. Manipulation
  - 4.3.2. Navigation
  - 4.3.3. Human-robot Interaction
  - 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
  - 4.4.1. Web Agent
  - 4.4.2. Accessibility
  - 4.4.3. Medical and Healthcare
  - 4.4.4. Social Goodness
1. ⛑️ Challenges
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
  - 5.4.1. Multi-modality Alignment
    - 5.4.2. Commonsense and Physics Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset

0. <a name='Citations'></a>Citation

@InProceedings{Li_2025_CVPR,
    author    = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
    title     = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2025},
    pages     = {1587-1606}
}

1. <a name='vlms'></a>📚 SoTA VLMs

Model	Year	Architecture	Training Data	Parameters	Vision Encoder/Tokenizer	Pretrained Backbone Model
Erin 5.0 (Baidu)	02/05/2026	Unified Model (Visual, Text, Audio)	Unified Modality Dataset	-	CNN–ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation)	Unified Autoregressive Transformer
Gemini 3	11/18/2025	Unified Model	Undisclosed	-	-	-
Emu3.5	10/30/2025	Deconder-only	Unified Modality Dataset	-	SigLIP	Qwen3
DeepSeek-OCR	10/20/2025	Encoder-Deconder	70% OCR, 20% general vision, 10% text-only	3B	DeepEncoder	DeepSeek-3B
Qwen3-VL	10/11/2025	Decoder-Only	-	8B/4B	ViT	Qwen3
Qwen3-VL-MoE	09/25/2025	Decoder-Only	-	235B-A22B	ViT	Qwen3
Qwen3-Omni (Visual/Audio/Text)	09/21/2025	-	Video/Audio/Image	30B	ViT	Qwen3-Omni-MoE-Thinker
LLaVA-Onevision-1.5	09/15/2025	-	Mid-Training-85M & SFT	8B	Qwen2VLImageProcessor	Qwen3
InternVL3.5	08/25/2025	Decoder-Only	multimodal & text-only	30B/38B/241B	InternViT-300M/6B	Qwen3 / GPT-OSS
SkyWork-Unipic-1.5B	07/29/2025	-	image/video..	-	-	-
Grok 4	07/09/2025	-	image/video..	1-2 Trillion	-	-
Kwai Keye-VL (Kuaishou)	07/02/2025	Decdoer-only	image/video..	8B	ViT	QWen-3-8B
OmniGen2	06/23/2025	Decdoer-only & VAE	LLaVA-OneVision/ SAM-LLaVA..	-	ViT	QWen-2.5-VL
Gemini-2.5-Pro	06/17/2025	-	-	-	-	-
GPT-o3/o4-mini	06/10/2025	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
Mimo-VL (Xiaomi)	06/04/2025	Decdoer-only	24 Trillion MLLM tokens	7B	[Qwen2.5-ViT	Mimo-7B-base
BAGEL (Bytedance)	05/20/2025	Unified Model	Video/Image/Text	7B	SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786)	Qwen2.5
BLIP3-o	05/14/2025	Decdoer-only	(BLIP3-o 60K) GPT-4o Generated Image Generation Data	4/8B	ViT	QWen-2.5-VL
InternVL-3	04/14/2025	Decdoer-only	200 Billion Tokens	1/2/8/9/14/38/78B	ViT-300M/6B	InterLM2.5/QWen2.5
LLaMA4-Scout/Maverick	04/04/2025	Decdoer-only	40/20 Trillion Tokens	17B	MetaClip	LLaMA4
Qwen2.5-Omni	03/26/2025	Decdoer-only	Video/Audio/Image/Text	7B	Qwen2-Audio/Qwen2.5-VL ViT	End-to-End Mini-Omni
QWen2.5-VL	01/28/2025	Decdoer-only	Image caption, VQA, grounding agent, long video	3B/7B/72B	Redesigned ViT	Qwen2.5
Ola	2025	Decoder-only	Image/Video/Audio/Text	7B	OryxViT	Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2)
Ocean-OCR	2025	Decdoer-only	Pure Text, Caption, Interleaved, OCR	3B	NaViT	Pretrained from scratch
SmolVLM	2025	Decoder-only	SmolVLM-Instruct	250M & 500M	SigLIP	SmolLM
DeepSeek-Janus-Pro	2025	Decoder-only	Undisclosed	7B	SigLIP	DeepSeek-Janus-Pro
Inst-IT	2024	Decoder-only	Inst-IT Dataset, LLaVA-NeXT-Data	7B	CLIP/Vicuna, SigLIP/Qwen2	LLaVA-NeXT
DeepSeek-VL2	2024	Decoder-only	WiT, WikiHow	4.5B x 74	SigLIP/SAMB	DeepSeekMoE
xGen-MM (BLIP-3)	2024	Decoder-only	MINT-1T, OBELICS, Caption	4B	ViT + Perceiver Resampler	Phi-3-mini
TransFusion	2024	Encoder-decoder	Undisclosed	7B	VAE Encoder	Pretrained from scratch on transformer architecture
Baichuan Ocean Mini	2024	Decoder-only	Image/Video/Audio/Text	7B	CLIP ViT-L/14	Baichuan
LLaMA 3.2-vision	2024	Decoder-only	Undisclosed	11B-90B	CLIP	LLaMA-3.1
Pixtral	2024	Decoder-only	Undisclosed	12B	CLIP ViT-L/14	Mistral Large 2
Qwen2-VL	2024	Decoder-only	Undisclosed	7B-14B	EVA-CLIP ViT-L	Qwen-2
NVLM	2024	Encoder-decoder	LAION-115M	8B-24B	Custom ViT	Qwen-2-Instruct
Emu3	2024	Decoder-only	Aquila	7B	MoVQGAN	LLaMA-2
Claude 3	2024	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
InternVL	2023	Encoder-decoder	LAION-en, LAION- multi	7B/20B	Eva CLIP ViT-g	QLLaMA
InstructBLIP	2023	Encoder-decoder	CoCo, VQAv2	13B	ViT	Flan-T5, Vicuna
CogVLM	2023	Encoder-decoder	LAION-2B ,COYO-700M	18B	CLIP ViT-L/14	Vicuna
PaLM-E	2023	Decoder-only	All robots, WebLI	562B	ViT	PaLM
LLaVA-1.5	2023	Decoder-only	COCO	13B	CLIP ViT-L/14	Vicuna
Gemini	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
GPT-4V	2023	Decoder-only	Undisclosed	Undisclosed	Undisclosed	Undisclosed
BLIP-2	2023	Encoder-decoder	COCO, Visual Genome	7B-13B	ViT-g	Open Pretrained Transformer (OPT)
Flamingo	2022	Decoder-only	M3W, ALIGN	80B	Custom	Chinchilla
BLIP	2022	Encoder-decoder	COCO, Visual Genome	223M-400M	ViT-B/L/g	Pretrained from scratch
CLIP	2021	Encoder-decoder	400M image-text pairs	63M-355M	ViT/ResNet	Pretrained from scratch

2. <a name='Dataset'></a>🗂️ Benchmarks and Evaluation

2.1. <a name='TrainingDatasetforVLM'></a> Datasets for Training VLMs

Dataset	Task	Size
MMFineReason(/1/30/2026)	REasoning	1.8M
FineVision(09/04/2025)	Mixed Domain	24.3 M/4.48TB

2.2. <a name='DatasetforVLM'></a> Datasets and Evaluation for VLM

🧮 Visual Math (+ Visual Math Reasoning)

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
MathVision	Visual Math	MC / Answer Match	Human	3.04	Repo
MathVista	Visual Math	MC / Answer Match	Human	6	Repo
MathVerse	Visual Math	MC	Human	4.6	Repo
VisNumBench	Visual Number Reasoning	MC	Python Program generated/Web Collection/Real life photos	1.91	Repo

💬 Benchmark for Unified Models

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
RealUnify	Math, World knowledge, Image Gen	Direct & StepWise Eval (Sec 3.3)	Script & Humanverification	1.0	Repo
Uni-MMMU	Science, Code, Image Gen	DreamSim (Image Gen Eval) & String Matching (Understanding Eval)	-	1.0	Repo

🎞️ Video Understanding

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
VideoHallu	Video Understanding	LLM Eval	Human	3.2	Repo
Video SimpleQA	Video Understanding	LLM Eval	Human	2.03	Repo
MovieChat	Video Understanding	LLM Eval	Human	1	Repo
Perception‑Test	Video Understanding	MC	Crowd	11.6	Repo
VideoMME	Video Understanding	MC	Experts	2.7	Site
EgoSchem	Video Understanding	MC	Synth / Human	5	Site
Inst‑IT‑Bench	Fine‑grained Image & Video	MC & LLM	Human / Synth	2	Repo

💬 Multimodal Conversation

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
VisionArena	Multimodal Conversation	Pairwise Pref	Human	23	Repo

🧠 Multimodal General Intelligence

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
MMLU	General MM	MC	Human	15.9	Repo
MMStar	General MM	MC	Human	1.5	Site
NaturalBench	General MM	Yes/No, MC	Human	10	HF
PHYSBENCH	Visual Math Reasoning	MC	Grad STEM	0.10	Repo

🔎 Visual Reasoning / VQA (+ Multilingual & OCR)

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
EMMA	Visual Reasoning	MC	Human + Synth	2.8	Repo
MMTBENCH	Visual Reasoning & QA	MC	AI Experts	30.1	Repo
MM‑Vet	OCR / Visual Reasoning	LLM Eval	Human	0.2	Repo
MM‑En/CN	Multilingual MM Understanding	MC	Human	3.2	Repo
GQA	Visual Reasoning & QA	Answer Match	Seed + Synth	22	Site
VCR	Visual Reasoning & QA	MC	MTurks	290	Site
VQAv2	Visual Reasoning & QA	Yes/No, Ans Match	MTurks	1100	Repo
MMMU	Visual Reasoning & QA	Ans Match, MC	College	11.5	Site
MMMU-Pro	Visual Reasoning & QA	Ans Match, MC	College	5.19	Site
R1‑Onevision	Visual Reasoning & QA	MC	Human	155	Repo
VLM²‑Bench	Visual Reasoning & QA	Ans Match, MC	Human	3	Site
VisualWebInstruct	Visual Reasoning & QA	LLM Eval	Web	0.9	Site

📝 Visual Text / Document Understanding (+ Charts)

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
TextVQA	Visual Text Understanding	Ans Match	Expert	28.6	Repo
DocVQA	Document VQA	Ans Match	Crowd	50	Site
ChartQA	Chart Graphic Understanding	Ans Match	Crowd / Synth	32.7	Repo

🌄 Text‑to‑Image Generation

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
MSCOCO‑30K	Text‑to‑Image	BLEU, ROUGE, Sim	MTurks	30	Site
GenAI‑Bench	Text‑to‑Image	Human Rating	Human	80	HF

🚨 Hallucination Detection / Control

Dataset	Task	Eval Protocol	Annotators	Size (K)	Code / Site
HallusionBench	Hallucination	Yes/No	Human	1.13	Repo
POPE	Hallucination	Yes/No	Human	9	Repo
CHAIR	Hallucination	Yes/No	Human	124	Repo
MHalDetect	Hallucination	Ans Match	Human	4	Repo
Hallu‑Pi	Hallucination	Ans Match	Human	1.26	Repo
HallE‑Control	Hallucination	Yes/No	Human	108	Repo
AutoHallusion	Hallucination	Ans Match	Synth	3.129	Repo
BEAF	Hallucination	Yes/No	Human	26	Site
GAIVE	Hallucination	Ans Match	Synth	320	Repo
HalEval	Hallucination	Yes/No	Crowd / Synth	2	Repo
AMBER	Hallucination	Ans Match	Human	15.22	Repo

2.3. <a name='DatasetforEmbodiedVLM'></a> Benchmark Datasets, Simulators, and Generative Models for Embodied VLM

Benchmark	Domain	Type	Project
Drive-Bench	Embodied AI	Autonomous Driving	Website
Habitat, Habitat 2.0, Habitat 3.0	Robotics (Navigation)	Simulator + Dataset	Website
Gibson	Robotics (Navigation)	Simulator + Dataset	Website, Github Repo
iGibson1.0, iGibson2.0	Robotics (Navigation)	Simulator + Dataset	Website, Document
Isaac Gym	Robotics (Navigation)	Simulator	Website, Github Repo
Isaac Lab	Robotics (Navigation)	Simulator	Website, Github Repo
AI2THOR	Robotics (Navigation)	Simulator	Website, Github Repo
ProcTHOR	Robotics (Navigation)	Simulator + Dataset	Website, Github Repo
VirtualHome	Robotics (Navigation)	Simulator	Website, Github Repo
ThreeDWorld	Robotics (Navigation)	Simulator	Website, Github Repo
VIMA-Bench	Robotics (Manipulation)	Simulator	Website, Github Repo
VLMbench	Robotics (Manipulation)	Simulator	Github Repo
CALVIN	Robotics (Manipulation)	Simulator	Website, Github Repo
GemBench	Robotics (Manipulation)	Simulator	Website, Github Repo
WebArena	Web Agent	Simulator	Website, Github Repo
UniSim	Robotics (Manipulation)	Generative Model, World Model	Website
GAIA-1	Robotics (Automonous Driving)	Generative Model, World Model	Website
LWM	Embodied AI	Generative Model, World Model	Website, Github Repo
Genesis	Embodied AI	Generative Model, World Model	Github Repo
EMMOE	Embodied AI	Generative Model, World Model	Paper
RoboGen	Embodied AI	Generative Model, World Model	Website
UnrealZoo	Embodied AI (Tracking, Navigation, Multi Agent)	Simulator	Website

3. <a name='posttraining'></a>⚒️ Post-Training

3.1. <a name='alignment'></a>RL Alignment for VLM

Title	Year	Paper	RL	Code
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning	10/12/2025	Paper	GRPO	-
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play	09/29/2025	Paper	GRPO	-
Vision-SR1: Self-rewarding vision-language model via reasoning decomposition	08/26/2025	Paper	GRPO	-
Group Sequence Policy Optimization	06/24/2025	Paper	GSPO	-
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning	05/20/2025	Paper	GRPO	-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	2025/04/10	Paper	GRPO	Code
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement	2025/03/21	Paper	GRPO	Code
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning	2025/03/10	Paper	GRPO	Code
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	2025	Paper	DPO	Code
Multimodal Open R1/R1-Multimodal-Journey	2025	-	GRPO	Code
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization	2025	Paper	GRPO	Code
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning	2025	-	PPO/REINFORCE++/GRPO	Code
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning	2025	Paper	REINFORCE Leave-One-Out (RLOO)	Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	2025	Paper	DPO	Code
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL	2025	Paper	PPO	Code
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models	2025	Paper	GRPO	Code
Unified Reward Model for Multimodal Understanding and Generation	2025	Paper	DPO	Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step	2025	Paper	DPO	Code
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning	2025	Paper	Online RL	-
Video-R1: Reinforcing Video Reasoning in MLLMs	2025	Paper	GRPO	Code

3.2. <a name='sft'></a>Finetuning for VLM

Title	Year	Paper	Website	Code
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	2025/04/21	Paper	Website	Code
OMNICAPTIONER: One Captioner to Rule Them All	2025/04/09	Paper	Website	Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	2024	Paper	Website	Code
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression	2024	Paper	Website	Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era	2024	Paper	Website	Code
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model	2024	Paper	-	-
Should VLMs be Pre-trained with Image Data?	2025	Paper	-	-
VisionArena: 230K Real World User-VLM Conversations with Preference Labels	2024	Paper	-	Code

3.3. <a name='vlm_github'></a>VLM Alignment github

Project	Repository Link
Verl	🔗 GitHub
EasyR1	🔗 GitHub
OpenR1	🔗 GitHub
LLaMAFactory	🔗 GitHub
MM-Eureka-Zero	🔗 GitHub
MM-RLHF	🔗 GitHub
LMM-R1	🔗 GitHub

3.4. <a name='vlm_prompt_engineering'></a>Prompt Optimization

Title	Year	Paper	Website	Code
In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer	2025/04/30	Paper	Website	Code

4. <a name='Toolenhancement'></a> ⚒️ Applications

4.1 Embodied VLM Agents

Title	Year	Paper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI	2024	Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding	2024	Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	2023	Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement	2024	📄 Paper
Training a Vision Language Model as Smartphone Assistant	2024	Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent	2024	Paper
Embodied Vision-Language Programmer from Environmental Feedback	2024	Paper
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method	2025	📄 Paper
MP-GUI: Modality Perception with MLLMs for GUI Understanding	2025	📄 Paper

4.2. <a name='GenerativeVisualMediaApplications'></a>Generative Visual Media Applications

Title	Year	Paper	Website	Code
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	2023	📄 Paper	🌍 Website	💾 Code
Spurious Correlation in Multimodal LLMs	2025	📄 Paper	-	-
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat	2025	📄 Paper	-	💾 Code
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	2025	📄 Paper	🌍 Website	💾 Code

4.3. <a name='RoboticsandEmbodiedAI'></a>Robotics and Embodied AI

Title	Year	Paper	Website	Code
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation	2024	📄 Paper	🌍 Website	-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	2024	📄 Paper	🌍 Website	-
Vision-language model-driven scene understanding and robotic object manipulation	2024	📄 Paper	-	-
Guiding Long-Horizon Task and Motion Planning with Vision Language Models	2024	📄 Paper	🌍 Website	-
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers	2023	📄 Paper	🌍 Website	-
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model	2024	📄 Paper	-	-
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?	2023	📄 Paper	🌍 Website	-
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models	2024	📄 Paper	🌍 Website	-
MotionGPT: Human Motion as a Foreign Language	2023	📄 Paper	-	💾 Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment	2024	📄 Paper	-	-
Language to Rewards for Robotic Skill Synthesis	2023	📄 Paper	🌍 Website	-
Eureka: Human-Level Reward Design via Coding Large Language Models	2023	📄 Paper	🌍 Website	-
Integrated Task and Motion Planning	2020	📄 Paper	-	-
Jailbreaking LLM-Controlled Robots	2024	📄 Paper	🌍 Website	-
Robots Enact Malignant Stereotypes	2022	📄 Paper	🌍 Website	-
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions	2024	📄 Paper	-	-
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics	2024	📄 Paper	🌍 Website	-
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	2025	📄 Paper	🌍 Website	💾 Code & Dataset
Gemini Robotics: Bringing AI into the Physical World	2025	📄 Technical Report	🌍 Website	-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation	2024	📄 Paper	🌍 Website	-
Magma: A Foundation Model for Multimodal AI Agents	2025	📄 Paper	🌍 Website	💾 Code
DayDreamer: World Models for Physical Robot Learning	2022	📄 Paper	🌍 Website	💾 Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models	2025	📄 Paper	-	-
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback	2024	📄 Paper	🌍 Website	💾 Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data	2024	📄 Paper	🌍 Website	💾 Code
Unified Video Action Model	2025	📄 Paper	🌍 Website	💾 Code
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model	2025	📄 Paper	🌍 Website	💾 Code

4.3.1. <a name='Manipulation'></a>Manipulation

Title	Year	Paper	Website	Code
VIMA: General Robot Manipulation with Multimodal Prompts	2022	📄 Paper	🌍 Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model	2023	📄 Paper	-	-
Creative Robot Tool Use with Large Language Models	2023	📄 Paper	🌍 Website	-
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics	2024	📄 Paper	-	-
RT-1: Robotics Transformer for Real-World Control at Scale	2022	📄 Paper	🌍 Website	-
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	2023	📄 Paper	🌍 Website	-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models	2023	📄 Paper	🌍 Website	-
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models	2024	📄 Paper	🌍 Website	-
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors	2025	📄 Paper	🌍 Website	💾 Code
Masked World Models for Visual Control	2022	📄 Paper	🌍 Website	💾 Code
Multi-View Masked World Models for Visual Robotic Manipulation	2023	📄 Paper	🌍 Website	💾 Code

4.3.2. <a name='Navigation'></a>Navigation

Title	Year	Paper	Website	Code
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings	2022	📄 Paper	-	-
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation	2024	📄 Paper	-	-
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action	2022	📄 Paper	🌍 Website	-
NaVILA: Legged Robot Vision-Language-Action Model for Navigation	2022	📄 Paper	🌍 Website	-
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation	2024	📄 Paper	-	-
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning	2023	📄 Paper	🌍 Website	-
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments	2025	📄 Paper	-	-
Navigation World Models	2024	📄 Paper	🌍 Website	-

4.3.3. <a name='HumanRobotInteraction'></a>Human-robot Interaction

Title	Year	Paper	Website	Code
MUTEX: Learning Unified Policies from Multimodal Task Specifications	2023	📄 Paper	🌍 Website	-
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction	2024	📄 Paper	🌍 Website	-
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models	2024	📄 Paper	-	-

4.3.4. <a name='AutonomousDriving'></a>Autonomous Driving

Title	Year	Paper	Website	Code
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	01/07/2025	📄 Paper	🌍 Website	-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	2024	📄 Paper	🌍 Website	-
GPT-Driver: Learning to Drive with GPT	2023	📄 Paper	-	-
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving	2023	📄 Paper	🌍 Website	-
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving	2023	📄 Paper	-	-
Referring Multi-Object Tracking	2023	📄 Paper	-	💾 Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision	2023	📄 Paper	-	💾 Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling	2023	📄 Paper	-	-
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models	2023	📄 Paper	🌍 Website	-
VLP: Vision Language Planning for Autonomous Driving	2024	📄 Paper	-	-
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model	2023	📄 Paper	-	-

4.4. <a name='Human-CenteredAI'></a>Human-Centered AI

Title	Year	Paper	Website	Code
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis	2024	📄 Paper	-	💾 Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application	2024	📄 Paper	-	-
Pretrained Language Models as Visual Planners for Human Assistance	2023	📄 Paper	-	-
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research	2024	📄 Paper	-	-
Image and Data Mining in Reticular Chemistry Using GPT-4V	2023	📄 Paper	-	-

4.4.1. <a name='WebAgent'></a>Web Agent

Title	Year	Paper	Website	Code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis	2023	📄 Paper	-	-
CogAgent: A Visual Language Model for GUI Agents	2023	📄 Paper	-	💾 Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models	2024	📄 Paper	-	💾 Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent	2024	📄 Paper	-	💾 Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent	2024	📄 Paper	-	💾 Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation	2024	📄 Paper	-	💾 Code

4.4.2. <a name='Accessibility'></a>Accessibility

Title	Year	Paper	Website	Code
X-World: Accessibility, Vision, and Autonomy Meet	2021	📄 Paper	-	-
Context-Aware Image Descriptions for Web Accessibility	2024	📄 Paper	-	-
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models	2024	📄 Paper	-	-

4.4.3. <a name='Medical and Healthcare'></a>Healthcare

Title	Year	Paper	Website	Code
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning	12/2025	📄 Paper	-	💾 Code
Frontiers in Intelligent Colonoscopy	02/2025	📄 Paper	-	💾 Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge	2024	📄 Paper	-	💾 Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology	2024	📄 Paper	-	-
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization	2023	📄 Paper	-	-
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text	2022	📄 Paper	-	💾 Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner	2023	📄 Paper	-	💾 Code

4.4.4. <a name='SocialGoodness'></a>Social Goodness

Title	Year	Paper	Website	Code
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy	2024	📄 Paper	-	-
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence	2024	📄 Paper	-	-
Harnessing Large Vision and Language Models in Agriculture: A Review	2024	📄 Paper	-	-
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping	2024	📄 Paper	-	-
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models	2024	📄 Paper	-	💾 Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images	2024	📄 Paper	-	-
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models	2024	📄 Paper	-	💾 Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps	2024	📄 Paper	-	💾 Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation	2021	📄 Paper	-	-
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling	2024	📄 Paper	-	-

5. <a name='Challenges'></a>Challenges

5.1 <a name='Hallucination'></a>Hallucination

Title	Year	Paper	Website	Code
Object Hallucination in Image Captioning	2018	📄 Paper	-	-
Evaluating Object Hallucination in Large Vision-Language Models	2023	📄 Paper	-	💾 Code
Detecting and Preventing Hallucinations in Large Vision Language Models	2023	📄 Paper	-	-
HallE-Control: Controlling Object Hallucination in Large Multimodal Models	2023	📄 Paper	-	💾 Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs	2024	📄 Paper	-	💾 Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models	2024	📄 Paper	🌍 Website	-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models	2023	📄 Paper	-	💾 Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models	2024	📄 Paper	🌍 Website	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	2023	📄 Paper	-	💾 Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models	2024	📄 Paper	-	💾 Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation	2023	📄 Paper	-	💾 Code

5.2 <a name='Safety'></a>Safety

Title	Year	Paper	Website	Code
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models	2024	📄 Paper	🌍 Website	-
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments	2023	📄 Paper	-	-
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models	2024	📄 Paper	-	-
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks	2024	📄 Paper	-	-
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models	2024	📄 Paper	-	💾 Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models	2024	📄 Paper	-	-
Jailbreaking Attack against Multimodal Large Language Model	2024	📄 Paper	-	-
Embodied Red Teaming for Auditing Robotic Foundation Models	2025	📄 Paper	🌍 Website	💾 Code
Safety Guardrails for LLM-Enabled Robots	2025	📄 Paper	-	-

5.3 <a name='Fairness'></a>Fairness

Title	Year	Paper	Website	Code
Hallucination of Multimodal Large Language Models: A Survey

Go to section

Search