TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

video SALMONN 2

video-SALMONN 2 is a caption-enhanced audio-visual LLM that jointly processes video and audio to produce detailed captions and QA responses. Released in 3B, 7B and 72B sizes, it tops benchmarks like Video-MME, WorldSense and AVUT for audio-visual QA and performs strongly on visual-only tests such as MLVU and LVBench. The repo provides training and evaluation code plus checkpoints and an upgraded video-SALMONN 2+ line.
New Text Gen 4
Released: June 1, 2025

Overview

video-SALMONN 2 is an audio-visual large language model from Tsinghua and ByteDance that uses video frames and sound to generate rich captions and answers, reaching state-of-the-art on many audio-visual QA and video understanding benchmarks.

About ByteDance

ByteDance is a multinational technology company known for its content platforms, including TikTok and Douyin.

Industry: Internet
Company Size: 10001+
Location: Beijing, CN
View Company Profile

Tools using video SALMONN 2

No tools found for this model yet.

Last updated: February 25, 2026
0 AIs selected
Clear selection
#
Name
Task