Qwen-RobotWorld

Qwen-RobotWorld

Model family: Qwen

Qwen-RobotWorld uses natural language as a unified action interface to predict future visual trajectories across diverse embodiment types. Its architecture features Double-Stream MMDiT with MLLM Action Encoding, coupling frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention. Trained on Embodied World Knowledge (EWK), an 8.6M video-text corpus with 200M+ frames spanning 20+ embodiments and 500+ action categories. A two-stage General+Expert Progressive Curriculum first learns general visual priors, then injects embodied specialization. Applications include synthetic data generation for policy training, scalable virtual environments for policy evaluation, and language-guided planning for robot control. Ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench.

Overview

Language-conditioned video world model for embodied AI that predicts physically grounded future visual trajectories from natural language instructions and current observations. Supports robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. Built on a 60-layer double-stream diffusion transformer coupled with Qwen2.5-VL semantics.

🎥Video generation 🤖Robotics 🚗Autonomous driving

About Alibaba

Chinese e-commerce and cloud leader behind Taobao, Tmall, and Alipay.

Industry: Retail

Company Size: 128197

Location: CN

Website: alibaba.com

View Company Profile

Other models from this family

View all models from this family

Last updated: June 18, 2026

Go to section

Search

Overview

About Alibaba

Other models from this family

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: