video SALMONN 2

video SALMONN 2

video-SALMONN 2 is a caption-enhanced audio-visual LLM that jointly processes video and audio to produce detailed captions and QA responses. Released in 3B, 7B and 72B sizes, it tops benchmarks like Video-MME, WorldSense and AVUT for audio-visual QA and performs strongly on visual-only tests such as MLVU and LVBench. The repo provides training and evaluation code plus checkpoints and an upgraded video-SALMONN 2+ line.

Overview

video-SALMONN 2 is an audio-visual large language model from Tsinghua and ByteDance that uses video frames and sound to generate rich captions and answers, reaching state-of-the-art on many audio-visual QA and video understanding benchmarks.

🎥Videos ❓Answers

About ByteDance

ByteDance is a multinational technology company known for its content platforms, including TikTok and Douyin.

Industry: Internet

Company Size: 10001+

Location: Beijing, CN

Website: bytedance.com

View Company Profile

Tools using video SALMONN 2

No tools found for this model yet.

Last updated: February 25, 2026

Search

Overview

About ByteDance

Tools using video SALMONN 2

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: