TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Qwen-RobotWorld

By Alibaba
Model family: Qwen
Qwen-RobotWorld uses natural language as a unified action interface to predict future visual trajectories across diverse embodiment types. Its architecture features Double-Stream MMDiT with MLLM Action Encoding, coupling frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention. Trained on Embodied World Knowledge (EWK), an 8.6M video-text corpus with 200M+ frames spanning 20+ embodiments and 500+ action categories. A two-stage General+Expert Progressive Curriculum first learns general visual priors, then injects embodied specialization. Applications include synthetic data generation for policy training, scalable virtual environments for policy evaluation, and language-guided planning for robot control. Ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench.
New Video Gen 1
Released: June 16, 2026

Overview

Language-conditioned video world model for embodied AI that predicts physically grounded future visual trajectories from natural language instructions and current observations. Supports robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. Built on a 60-layer double-stream diffusion transformer coupled with Qwen2.5-VL semantics.

About Alibaba

Chinese e-commerce and cloud leader behind Taobao, Tmall, and Alipay.

Industry: Retail
Company Size: 128197
Location: CN
Website: alibaba.com
View Company Profile

Tools using Qwen-RobotWorld

No tools found for this model yet.

Last updated: June 18, 2026
0 AIs selected
Clear selection
#
Name
Task