Vision Language Models Overview
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
README
Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
- Evaluate VLM benchmarks and corresponding link to the works
- Post-training/Alignment Newest related work for VLM alignment including RL, sft.
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Welcome to contribute and discuss!
๐คฉ Papers marked with a โญ๏ธ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
Table of Contents
-
- 3.1. RL Alignment for VLM
- 3.2. Regular finetuning (SFT)
- 3.3. VLM Alignment Github
- 3.4. Prompt Engineering
-
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
- 4.3.1. Manipulation
- 4.3.2. Navigation
- 4.3.3. Human-robot Interaction
- 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
- 4.4.1. Web Agent
- 4.4.2. Accessibility
- 4.4.3. Medical and Healthcare
- 4.4.4. Social Goodness
-
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
- 5.4.1. Multi-modality Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
0. <a name='Citations'></a>Citation
@InProceedings{Li_2025_CVPR,
author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1587-1606}
}
1. <a name='vlms'></a>๐ SoTA VLMs
| Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model |
|---|---|---|---|---|---|---|
| Erin 5.0 (Baidu) | 02/05/2026 | Unified Model (Visual, Text, Audio) | Unified Modality Dataset | - | CNNโViT (Understanding)/Next-Frame-and-Scale Prediction (Generation) | Unified Autoregressive Transformer |
| Gemini 3 | 11/18/2025 | Unified Model | Undisclosed | - | - | - |
| Emu3.5 | 10/30/2025 | Deconder-only | Unified Modality Dataset | - | SigLIP | Qwen3 |
| DeepSeek-OCR | 10/20/2025 | Encoder-Deconder | 70% OCR, 20% general vision, 10% text-only | 3B | DeepEncoder | DeepSeek-3B |
| Qwen3-VL | 10/11/2025 | Decoder-Only | - | 8B/4B | ViT | Qwen3 |
| Qwen3-VL-MoE | 09/25/2025 | Decoder-Only | - | 235B-A22B | ViT | Qwen3 |
| Qwen3-Omni (Visual/Audio/Text) | 09/21/2025 | - | Video/Audio/Image | 30B | ViT | Qwen3-Omni-MoE-Thinker |
| LLaVA-Onevision-1.5 | 09/15/2025 | - | Mid-Training-85M & SFT | 8B | Qwen2VLImageProcessor | Qwen3 |
| InternVL3.5 | 08/25/2025 | Decoder-Only | multimodal & text-only | 30B/38B/241B | InternViT-300M/6B | Qwen3 / GPT-OSS |
| SkyWork-Unipic-1.5B | 07/29/2025 | - | image/video.. | - | - | - |
| Grok 4 | 07/09/2025 | - | image/video.. | 1-2 Trillion | - | - |
| Kwai Keye-VL (Kuaishou) | 07/02/2025 | Decdoer-only | image/video.. | 8B | ViT | QWen-3-8B |
| OmniGen2 | 06/23/2025 | Decdoer-only & VAE | LLaVA-OneVision/ SAM-LLaVA.. | - | ViT | QWen-2.5-VL |
| Gemini-2.5-Pro | 06/17/2025 | - | - | - | - | - |
| GPT-o3/o4-mini | 06/10/2025 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Mimo-VL (Xiaomi) | 06/04/2025 | Decdoer-only | 24 Trillion MLLM tokens | 7B | [Qwen2.5-ViT | Mimo-7B-base |
| BAGEL (Bytedance) | 05/20/2025 | Unified Model | Video/Image/Text | 7B | SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786) | Qwen2.5 |
| BLIP3-o | 05/14/2025 | Decdoer-only | (BLIP3-o 60K) GPT-4o Generated Image Generation Data | 4/8B | ViT | QWen-2.5-VL |
| InternVL-3 | 04/14/2025 | Decdoer-only | 200 Billion Tokens | 1/2/8/9/14/38/78B | ViT-300M/6B | InterLM2.5/QWen2.5 |
| LLaMA4-Scout/Maverick | 04/04/2025 | Decdoer-only | 40/20 Trillion Tokens | 17B | MetaClip | LLaMA4 |
| Qwen2.5-Omni | 03/26/2025 | Decdoer-only | Video/Audio/Image/Text | 7B | Qwen2-Audio/Qwen2.5-VL ViT | End-to-End Mini-Omni |
| QWen2.5-VL | 01/28/2025 | Decdoer-only | Image caption, VQA, grounding agent, long video | 3B/7B/72B | Redesigned ViT | Qwen2.5 |
| Ola | 2025 | Decoder-only | Image/Video/Audio/Text | 7B | OryxViT | Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
| Ocean-OCR | 2025 | Decdoer-only | Pure Text, Caption, Interleaved, OCR | 3B | NaViT | Pretrained from scratch |
| SmolVLM | 2025 | Decoder-only | SmolVLM-Instruct | 250M & 500M | SigLIP | SmolLM |
| DeepSeek-Janus-Pro | 2025 | Decoder-only | Undisclosed | 7B | SigLIP | DeepSeek-Janus-Pro |
| Inst-IT | 2024 | Decoder-only | Inst-IT Dataset, LLaVA-NeXT-Data | 7B | CLIP/Vicuna, SigLIP/Qwen2 | LLaVA-NeXT |
| DeepSeek-VL2 | 2024 | Decoder-only | WiT, WikiHow | 4.5B x 74 | SigLIP/SAMB | DeepSeekMoE |
| xGen-MM (BLIP-3) | 2024 | Decoder-only | MINT-1T, OBELICS, Caption | 4B | ViT + Perceiver Resampler | Phi-3-mini |
| TransFusion | 2024 | Encoder-decoder | Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture |
| Baichuan Ocean Mini | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | Baichuan |
| LLaMA 3.2-vision | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | LLaMA-3.1 |
| Pixtral | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | Mistral Large 2 |
| Qwen2-VL | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | Qwen-2 |
| NVLM | 2024 | Encoder-decoder | LAION-115M | 8B-24B | Custom ViT | Qwen-2-Instruct |
| Emu3 | 2024 | Decoder-only | Aquila | 7B | MoVQGAN | LLaMA-2 |
| Claude 3 | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| InternVL | 2023 | Encoder-decoder | LAION-en, LAION- multi | 7B/20B | Eva CLIP ViT-g | QLLaMA |
| InstructBLIP | 2023 | Encoder-decoder | CoCo, VQAv2 | 13B | ViT | Flan-T5, Vicuna |
| CogVLM | 2023 | Encoder-decoder | LAION-2B ,COYO-700M | 18B | CLIP ViT-L/14 | Vicuna |
| PaLM-E | 2023 | Decoder-only | All robots, WebLI | 562B | ViT | PaLM |
| LLaVA-1.5 | 2023 | Decoder-only | COCO | 13B | CLIP ViT-L/14 | Vicuna |
| Gemini | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| GPT-4V | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| BLIP-2 | 2023 | Encoder-decoder | COCO, Visual Genome | 7B-13B | ViT-g | Open Pretrained Transformer (OPT) |
| Flamingo | 2022 | Decoder-only | M3W, ALIGN | 80B | Custom | Chinchilla |
| BLIP | 2022 | Encoder-decoder | COCO, Visual Genome | 223M-400M | ViT-B/L/g | Pretrained from scratch |
| CLIP | 2021 | Encoder-decoder | 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch |
2. <a name='Dataset'></a>๐๏ธ Benchmarks and Evaluation
2.1. <a name='TrainingDatasetforVLM'></a> Datasets for Training VLMs
| Dataset | Task | Sizeโฏ |
|---|---|---|
| MMFineReason(/1/30/2026) | REasoning | 1.8M |
| FineVision(09/04/2025) | Mixed Domain | 24.3 M/4.48TB |
2.2. <a name='DatasetforVLM'></a> Datasets and Evaluation for VLM
๐งฎโฏVisualย Mathโฏ(+โฏVisualโฏMathโฏReasoning)
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| MathVision | VisualโฏMath | MC /ย AnswerโฏMatch | Human | โฏ3.04 | Repo |
| MathVista | VisualโฏMath | MC /ย AnswerโฏMatch | Human | โฏ6 | Repo |
| MathVerse | VisualโฏMath | MC | Human | โฏ4.6 | Repo |
| VisNumBench | VisualโฏNumberโฏReasoning | MC | Python Program generated/Web Collection/Real life photos | โฏ1.91 | Repo |
๐ฌโฏBenchmark for Unified Models
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| RealUnify | Math, World knowledge, Image Gen | Direct & StepWise Eval (Sec 3.3) | Script & Humanverification | โฏ1.0 | Repo |
| Uni-MMMU | Science, Code, Image Gen | DreamSim (Image Gen Eval) & String Matching (Understanding Eval) | - | โฏ1.0 | Repo |
๐๏ธโฏVideoย Understanding
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| VideoHallu | VideoโฏUnderstanding | LLMโฏEval | Human | โฏ3.2 | Repo |
| Videoย SimpleQA | VideoโฏUnderstanding | LLMโฏEval | Human | โฏ2.03 | Repo |
| MovieChat | VideoโฏUnderstanding | LLMโฏEval | Human | โฏ1 | Repo |
| PerceptionโTest | VideoโฏUnderstanding | MC | Crowd | โฏ11.6 | Repo |
| VideoMME | VideoโฏUnderstanding | MC | Experts | โฏ2.7 | Site |
| EgoSchem | VideoโฏUnderstanding | MC | Synthโฏ/โฏHuman | โฏ5 | Site |
| InstโITโBench | Fineโgrainedย Imageโฏ&โฏVideo | MCโฏ&โฏLLM | Humanโฏ/โฏSynth | โฏ2 | Repo |
๐ฌโฏMultimodalโฏConversation
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| VisionArena | MultimodalโฏConversation | PairwiseโฏPref | Human | โฏ23 | Repo |
๐ง โฏMultimodalย Generalย Intelligence
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| MMLU | Generalย MM | MC | Human | โฏ15.9 | Repo |
| MMStar | Generalย MM | MC | Human | โฏ1.5 | Site |
| NaturalBench | Generalย MM | Yes/No,โฏMC | Human | โฏ10 | HF |
| PHYSBENCH | VisualโฏMathโฏReasoning | MC | GradโฏSTEM | โฏ0.10 | Repo |
๐โฏVisualโฏReasoningย /ย VQAย (+โฏMultilingualโฏ&โฏOCR)
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| EMMA | VisualโฏReasoning | MC | Human + Synth | โฏ2.8 | Repo |
| MMTBENCH | VisualโฏReasoning & QA | MC | AIโฏExperts | โฏ30.1 | Repo |
| MMโVet | OCRโฏ/โฏVisualโฏReasoning | LLMโฏEval | Human | โฏ0.2 | Repo |
| MMโEn/CN | Multilingualย MMโฏUnderstanding | MC | Human | โฏ3.2 | Repo |
| GQA | VisualโฏReasoning & QA | AnswerโฏMatch | Seedโฏ+โฏSynth | โฏ22 | Site |
| VCR | VisualโฏReasoning & QA | MC | MTurks | โฏ290 | Site |
| VQAv2 | VisualโฏReasoning & QA | Yes/No,โฏAnsโฏMatch | MTurks | โฏ1100 | Repo |
| MMMU | VisualโฏReasoning & QA | AnsโฏMatch,โฏMC | College | โฏ11.5 | Site |
| MMMU-Pro | VisualโฏReasoning & QA | AnsโฏMatch,โฏMC | College | โฏ5.19 | Site |
| R1โOnevision | VisualโฏReasoning & QA | MC | Human | โฏ155 | Repo |
| VLMยฒโBench | VisualโฏReasoning & QA | AnsโฏMatch,โฏMC | Human | โฏ3 | Site |
| VisualWebInstruct | VisualโฏReasoning & QA | LLMโฏEval | Web | โฏ0.9 | Site |
๐โฏVisualย Textโฏ/โฏDocumentโฏUnderstandingย (+โฏCharts)
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| TextVQA | VisualโฏTextโฏUnderstanding | AnsโฏMatch | Expert | โฏ28.6 | Repo |
| DocVQA | DocumentโฏVQA | AnsโฏMatch | Crowd | โฏ50 | Site |
| ChartQA | ChartโฏGraphicโฏUnderstanding | AnsโฏMatch | Crowdโฏ/โฏSynth | โฏ32.7 | Repo |
๐โฏTextโtoโImageย Generation
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| MSCOCOโ30K | TextโtoโImage | BLEU,โฏROUGE,โฏSim | MTurks | โฏ30 | Site |
| GenAIโBench | TextโtoโImage | Humanย Rating | Human | โฏ80 | HF |
๐จโฏHallucinationย Detectionโฏ/โฏControl
| Dataset | Task | Eval Protocol | Annotators | Sizeโฏ(K) | Code / Site |
|---|---|---|---|---|---|
| HallusionBench | Hallucination | Yes/No | Human | โฏ1.13 | Repo |
| POPE | Hallucination | Yes/No | Human | โฏ9 | Repo |
| CHAIR | Hallucination | Yes/No | Human | โฏ124 | Repo |
| MHalDetect | Hallucination | AnsโฏMatch | Human | โฏ4 | Repo |
| HalluโPi | Hallucination | AnsโฏMatch | Human | โฏ1.26 | Repo |
| HallEโControl | Hallucination | Yes/No | Human | โฏ108 | Repo |
| AutoHallusion | Hallucination | AnsโฏMatch | Synth | โฏ3.129 | Repo |
| BEAF | Hallucination | Yes/No | Human | โฏ26 | Site |
| GAIVE | Hallucination | AnsโฏMatch | Synth | โฏ320 | Repo |
| HalEval | Hallucination | Yes/No | Crowdโฏ/โฏSynth | โฏ2 | Repo |
| AMBER | Hallucination | AnsโฏMatch | Human | โฏ15.22 | Repo |
2.3. <a name='DatasetforEmbodiedVLM'></a> Benchmark Datasets, Simulators, and Generative Models for Embodied VLM
| Benchmark | Domain | Type | Project |
|---|---|---|---|
| Drive-Bench | Embodied AI | Autonomous Driving | Website |
| Habitat, Habitat 2.0, Habitat 3.0 | Robotics (Navigation) | Simulator + Dataset | Website |
| Gibson | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| iGibson1.0, iGibson2.0 | Robotics (Navigation) | Simulator + Dataset | Website, Document |
| Isaac Gym | Robotics (Navigation) | Simulator | Website, Github Repo |
| Isaac Lab | Robotics (Navigation) | Simulator | Website, Github Repo |
| AI2THOR | Robotics (Navigation) | Simulator | Website, Github Repo |
| ProcTHOR | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| VirtualHome | Robotics (Navigation) | Simulator | Website, Github Repo |
| ThreeDWorld | Robotics (Navigation) | Simulator | Website, Github Repo |
| VIMA-Bench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| VLMbench | Robotics (Manipulation) | Simulator | Github Repo |
| CALVIN | Robotics (Manipulation) | Simulator | Website, Github Repo |
| GemBench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| WebArena | Web Agent | Simulator | Website, Github Repo |
| UniSim | Robotics (Manipulation) | Generative Model, World Model | Website |
| GAIA-1 | Robotics (Automonous Driving) | Generative Model, World Model | Website |
| LWM | Embodied AI | Generative Model, World Model | Website, Github Repo |
| Genesis | Embodied AI | Generative Model, World Model | Github Repo |
| EMMOE | Embodied AI | Generative Model, World Model | Paper |
| RoboGen | Embodied AI | Generative Model, World Model | Website |
| UnrealZoo | Embodied AI (Tracking, Navigation, Multi Agent) | Simulator | Website |
3. <a name='posttraining'></a>โ๏ธ Post-Training
3.1. <a name='alignment'></a>RL Alignment for VLM
| Title | Year | Paper | RL | Code |
|---|---|---|---|---|
| Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning | 10/12/2025 | Paper | GRPO | - |
| Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | 09/29/2025 | Paper | GRPO | - |
| Vision-SR1: Self-rewarding vision-language model via reasoning decomposition | 08/26/2025 | Paper | GRPO | - |
| Group Sequence Policy Optimization | 06/24/2025 | Paper | GSPO | - |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | 05/20/2025 | Paper | GRPO | - |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | 2025/04/10 | Paper | GRPO | Code |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement | 2025/03/21 | Paper | GRPO | Code |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | 2025/03/10 | Paper | GRPO | Code |
| OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | 2025 | Paper | DPO | Code |
| Multimodal Open R1/R1-Multimodal-Journey | 2025 | - | GRPO | Code |
| R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization | 2025 | Paper | GRPO | Code |
| Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning | 2025 | - | PPO/REINFORCE++/GRPO | Code |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | 2025 | Paper | REINFORCE Leave-One-Out (RLOO) | Code |
| MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | 2025 | Paper | DPO | Code |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | 2025 | Paper | PPO | Code |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | 2025 | Paper | GRPO | Code |
| Unified Reward Model for Multimodal Understanding and Generation | 2025 | Paper | DPO | Code |
| Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | 2025 | Paper | DPO | Code |
| All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning | 2025 | Paper | Online RL | - |
| Video-R1: Reinforcing Video Reasoning in MLLMs | 2025 | Paper | GRPO | Code |
3.2. <a name='sft'></a>Finetuning for VLM
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | 2025/04/21 | Paper | Website | Code |
| OMNICAPTIONER: One Captioner to Rule Them All | 2025/04/09 | Paper | Website | Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | Paper | Website | Code |
| LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression | 2024 | Paper | Website | Code |
| ViTamin: Designing Scalable Vision Models in the Vision-Language Era | 2024 | Paper | Website | Code |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | 2024 | Paper | - | - |
| Should VLMs be Pre-trained with Image Data? | 2025 | Paper | - | - |
| VisionArena: 230K Real World User-VLM Conversations with Preference Labels | 2024 | Paper | - | Code |
3.3. <a name='vlm_github'></a>VLM Alignment github
| Project | Repository Link |
|---|---|
| Verl | ๐ GitHub |
| EasyR1 | ๐ GitHub |
| OpenR1 | ๐ GitHub |
| LLaMAFactory | ๐ GitHub |
| MM-Eureka-Zero | ๐ GitHub |
| MM-RLHF | ๐ GitHub |
| LMM-R1 | ๐ GitHub |
3.4. <a name='vlm_prompt_engineering'></a>Prompt Optimization
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer | 2025/04/30 | Paper | Website | Code |
4. <a name='Toolenhancement'></a> โ๏ธ Applications
4.1 Embodied VLM Agents
| Title | Year | Paper Link |
|---|---|---|
| Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI | 2024 | Paper |
| ScreenAI: A Vision-Language Model for UI and Infographics Understanding | 2024 | Paper |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation | 2023 | Paper |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | 2024 | ๐ Paper |
| Training a Vision Language Model as Smartphone Assistant | 2024 | Paper |
| ScreenAgent: A Vision-Language Model-Driven Computer Control Agent | 2024 | Paper |
| Embodied Vision-Language Programmer from Environmental Feedback | 2024 | Paper |
| VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method | 2025 | ๐ Paper |
| MP-GUI: Modality Perception with MLLMs for GUI Understanding | 2025 | ๐ Paper |
4.2. <a name='GenerativeVisualMediaApplications'></a>Generative Visual Media Applications
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | 2023 | ๐ Paper | ๐ Website | ๐พ Code |
| Spurious Correlation in Multimodal LLMs | 2025 | ๐ Paper | - | - |
| WeGen: A Unified Model for Interactive Multimodal Generation as We Chat | 2025 | ๐ Paper | - | ๐พ Code |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
4.3. <a name='RoboticsandEmbodiedAI'></a>Robotics and Embodied AI
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation | 2024 | ๐ Paper | ๐ Website | - |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | ๐ Paper | ๐ Website | - |
| Vision-language model-driven scene understanding and robotic object manipulation | 2024 | ๐ Paper | - | - |
| Guiding Long-Horizon Task and Motion Planning with Vision Language Models | 2024 | ๐ Paper | ๐ Website | - |
| AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers | 2023 | ๐ Paper | ๐ Website | - |
| VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model | 2024 | ๐ Paper | - | - |
| Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? | 2023 | ๐ Paper | ๐ Website | - |
| DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models | 2024 | ๐ Paper | ๐ Website | - |
| MotionGPT: Human Motion as a Foreign Language | 2023 | ๐ Paper | - | ๐พ Code |
| Learning Reward for Robot Skills Using Large Language Models via Self-Alignment | 2024 | ๐ Paper | - | - |
| Language to Rewards for Robotic Skill Synthesis | 2023 | ๐ Paper | ๐ Website | - |
| Eureka: Human-Level Reward Design via Coding Large Language Models | 2023 | ๐ Paper | ๐ Website | - |
| Integrated Task and Motion Planning | 2020 | ๐ Paper | - | - |
| Jailbreaking LLM-Controlled Robots | 2024 | ๐ Paper | ๐ Website | - |
| Robots Enact Malignant Stereotypes | 2022 | ๐ Paper | ๐ Website | - |
| LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions | 2024 | ๐ Paper | - | - |
| Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics | 2024 | ๐ Paper | ๐ Website | - |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | 2025 | ๐ Paper | ๐ Website | ๐พ Code & Dataset |
| Gemini Robotics: Bringing AI into the Physical World | 2025 | ๐ Technical Report | ๐ Website | - |
| GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation | 2024 | ๐ Paper | ๐ Website | - |
| Magma: A Foundation Model for Multimodal AI Agents | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
| DayDreamer: World Models for Physical Robot Learning | 2022 | ๐ Paper | ๐ Website | ๐พ Code |
| Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | ๐ Paper | - | - |
| RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback | 2024 | ๐ Paper | ๐ Website | ๐พ Code |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | ๐ Paper | ๐ Website | ๐พ Code |
| Unified Video Action Model | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
| HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
4.3.1. <a name='Manipulation'></a>Manipulation
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VIMA: General Robot Manipulation with Multimodal Prompts | 2022 | ๐ Paper | ๐ Website | |
| Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model | 2023 | ๐ Paper | - | - |
| Creative Robot Tool Use with Large Language Models | 2023 | ๐ Paper | ๐ Website | - |
| RoboVQA: Multimodal Long-Horizon Reasoning for Robotics | 2024 | ๐ Paper | - | - |
| RT-1: Robotics Transformer for Real-World Control at Scale | 2022 | ๐ Paper | ๐ Website | - |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023 | ๐ Paper | ๐ Website | - |
| Open X-Embodiment: Robotic Learning Datasets and RT-X Models | 2023 | ๐ Paper | ๐ Website | - |
| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 2024 | ๐ Paper | ๐ Website | - |
| AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
| Masked World Models for Visual Control | 2022 | ๐ Paper | ๐ Website | ๐พ Code |
| Multi-View Masked World Models for Visual Robotic Manipulation | 2023 | ๐ Paper | ๐ Website | ๐พ Code |
4.3.2. <a name='Navigation'></a>Navigation
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings | 2022 | ๐ Paper | - | - |
| LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation | 2024 | ๐ Paper | - | - |
| LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | 2022 | ๐ Paper | ๐ Website | - |
| NaVILA: Legged Robot Vision-Language-Action Model for Navigation | 2022 | ๐ Paper | ๐ Website | - |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation | 2024 | ๐ Paper | - | - |
| Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning | 2023 | ๐ Paper | ๐ Website | - |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments | 2025 | ๐ Paper | - | - |
| Navigation World Models | 2024 | ๐ Paper | ๐ Website | - |
4.3.3. <a name='HumanRobotInteraction'></a>Human-robot Interaction
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| MUTEX: Learning Unified Policies from Multimodal Task Specifications | 2023 | ๐ Paper | ๐ Website | - |
| LaMI: Large Language Models for Multi-Modal Human-Robot Interaction | 2024 | ๐ Paper | ๐ Website | - |
| VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models | 2024 | ๐ Paper | - | - |
4.3.4. <a name='AutonomousDriving'></a>Autonomous Driving
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives | 01/07/2025 | ๐ Paper | ๐ Website | - |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | 2024 | ๐ Paper | ๐ Website | - |
| GPT-Driver: Learning to Drive with GPT | 2023 | ๐ Paper | - | - |
| LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | 2023 | ๐ Paper | ๐ Website | - |
| Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | 2023 | ๐ Paper | - | - |
| Referring Multi-Object Tracking | 2023 | ๐ Paper | - | ๐พ Code |
| VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | 2023 | ๐ Paper | - | ๐พ Code |
| MotionLM: Multi-Agent Motion Forecasting as Language Modeling | 2023 | ๐ Paper | - | - |
| DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models | 2023 | ๐ Paper | ๐ Website | - |
| VLP: Vision Language Planning for Autonomous Driving | 2024 | ๐ Paper | - | - |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | 2023 | ๐ Paper | - | - |
4.4. <a name='Human-CenteredAI'></a>Human-Centered AI
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis | 2024 | ๐ Paper | - | ๐พ Code |
| LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration โ A Robot Sous-Chef Application | 2024 | ๐ Paper | - | - |
| Pretrained Language Models as Visual Planners for Human Assistance | 2023 | ๐ Paper | - | - |
| Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research | 2024 | ๐ Paper | - | - |
| Image and Data Mining in Reticular Chemistry Using GPT-4V | 2023 | ๐ Paper | - | - |
4.4.1. <a name='WebAgent'></a>Web Agent
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 2023 | ๐ Paper | - | - |
| CogAgent: A Visual Language Model for GUI Agents | 2023 | ๐ Paper | - | ๐พ Code |
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | 2024 | ๐ Paper | - | ๐พ Code |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent | 2024 | ๐ Paper | - | ๐พ Code |
| ScreenAgent: A Vision Language Model-driven Computer Control Agent | 2024 | ๐ Paper | - | ๐พ Code |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | ๐ Paper | - | ๐พ Code |
4.4.2. <a name='Accessibility'></a>Accessibility
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| X-World: Accessibility, Vision, and Autonomy Meet | 2021 | ๐ Paper | - | - |
| Context-Aware Image Descriptions for Web Accessibility | 2024 | ๐ Paper | - | - |
| Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models | 2024 | ๐ Paper | - | - |
4.4.3. <a name='Medical and Healthcare'></a>Healthcare
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning | 12/2025 | ๐ Paper | - | ๐พ Code |
| Frontiers in Intelligent Colonoscopy | 02/2025 | ๐ Paper | - | ๐พ Code |
| VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | 2024 | ๐ Paper | - | ๐พ Code |
| Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology | 2024 | ๐ Paper | - | - |
| M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization | 2023 | ๐ Paper | - | - |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | 2022 | ๐ Paper | - | ๐พ Code |
| Med-Flamingo: A Multimodal Medical Few-Shot Learner | 2023 | ๐ Paper | - | ๐พ Code |
4.4.4. <a name='SocialGoodness'></a>Social Goodness
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy | 2024 | ๐ Paper | - | - |
| Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence | 2024 | ๐ Paper | - | - |
| Harnessing Large Vision and Language Models in Agriculture: A Review | 2024 | ๐ Paper | - | - |
| A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping | 2024 | ๐ Paper | - | - |
| Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models | 2024 | ๐ Paper | - | ๐พ Code |
| DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Studentsโ Hand-Drawn Math Images | 2024 | ๐ Paper | - | - |
| MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | 2024 | ๐ Paper | - | ๐พ Code |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | 2024 | ๐ Paper | - | ๐พ Code |
| He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation | 2021 | ๐ Paper | - | - |
| UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling | 2024 | ๐ Paper | - | - |
5. <a name='Challenges'></a>Challenges
5.1 <a name='Hallucination'></a>Hallucination
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Object Hallucination in Image Captioning | 2018 | ๐ Paper | - | - |
| Evaluating Object Hallucination in Large Vision-Language Models | 2023 | ๐ Paper | - | ๐พ Code |
| Detecting and Preventing Hallucinations in Large Vision Language Models | 2023 | ๐ Paper | - | - |
| HallE-Control: Controlling Object Hallucination in Large Multimodal Models | 2023 | ๐ Paper | - | ๐พ Code |
| Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs | 2024 | ๐ Paper | - | ๐พ Code |
| BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models | 2024 | ๐ Paper | ๐ Website | - |
| HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 2023 | ๐ Paper | - | ๐พ Code |
| AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | 2024 | ๐ Paper | ๐ Website | - |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | 2023 | ๐ Paper | - | ๐พ Code |
| Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models | 2024 | ๐ Paper | - | ๐พ Code |
| AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | 2023 | ๐ Paper | - | ๐พ Code |
5.2 <a name='Safety'></a>Safety
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | 2024 | ๐ Paper | ๐ Website | - |
| Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments | 2023 | ๐ Paper | - | - |
| SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | 2024 | ๐ Paper | - | - |
| JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | 2024 | ๐ Paper | - | - |
| SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | 2024 | ๐ Paper | - | ๐พ Code |
| Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | 2024 | ๐ Paper | - | - |
| Jailbreaking Attack against Multimodal Large Language Model | 2024 | ๐ Paper | - | - |
| Embodied Red Teaming for Auditing Robotic Foundation Models | 2025 | ๐ Paper | ๐ Website | ๐พ Code |
| Safety Guardrails for LLM-Enabled Robots | 2025 | ๐ Paper | - | - |
5.3 <a name='Fairness'></a>Fairness
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Hallucination of Multimodal Large Language Models: A Survey |
MongoDB - Build AI That Scales
