Papers
-
Prompt Repetition Improves Non-Reasoning LLMs
-
Towards a Science of Scaling Agent Systems
-
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
-
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
-
GLM-TTS Technical Report
-
Native and Compact Structured Latents for 3D Generation
-
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
-
T5Gemma 2: Seeing, Reading, and Understanding Longer
-
Evaluating AI’s ability to perform scientific research tasks
-
AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path
-
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
-
Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10
-
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
-
KlingAvatar 2.0 Technical Report
-
Wait, Wait, Wait... Why Do Reasoning Models Loop?
-
World Models Can Leverage Human Videos for Dexterous Manipulation
-
Towards Scalable Pre-training of Visual Tokenizers for Generation
-
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
-
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
-
Diffusion Language Model Inference with Monte Carlo Tree Search
-
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
-
Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
-
CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
-
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
-
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
-
Glance: Accelerating Diffusion Models with 1 Sample
-
Sharp Monocular View Synthesis in Less Than a Second
-
On Learning-Curve Monotonicity for Maximum Likelihood Estimators
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
-
UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving
-
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
-
PAVAS: Physics-Aware Video-to-Audio Synthesis
-
Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation
-
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
-
MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
-
Process Reward Models That Think
-
Distribution Matching Variational AutoEncoder
-
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
-
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
-
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
-
Unsupervised decoding of encoded reasoning using language model interpretability
-
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
-
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
-
EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
-
Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
-
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
-
SIMA 2: A Generalist Embodied Agent for Virtual Worlds
-
SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control
-
Training LLMs for Honesty via Confessions
