Papers
-
OneRanker: Unified Generation and Ranking with One Model in Industrial Advertising Recommendation
-
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
-
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
-
Kling-MotionControl Technical Report
-
Architecting Trust in Artificial Epistemic Agents
-
Utonia: Toward One Encoder for All Point Clouds
-
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
-
Beyond Pixel Histories: World Models with Persistent 3D State
-
Heterogeneous Agent Collaborative Reinforcement Learning
-
Beyond Language Modeling: An Exploration of Multimodal Pretraining
-
Modular Memory is the Key to Continual Learning Agents
-
Expanding LLM Agent Boundaries with Strategy-Guided Exploratio
-
ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
-
LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
-
Agentic Code Reasoning
-
CuTe Layout Representation and Algebra
-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
-
WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memorie
-
How Well Does Agent Development Reflect Real-World Work?
-
Learn Hard Problems During RL with Reference Guided Fine-tuning
-
SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs
-
Process-of-Thought Reasoning for Videos
-
Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding ModelsHelmholtz Munich, Technical University of Munich, University of Tübingen
-
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
-
Mode Seeking meets Mean Seeking for Fast Long Video Generation
-
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
-
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
-
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
-
AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
-
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
-
SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene GenerationTsinghua University
-
SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
-
Generative Recommendation for Large-Scale Advertising
-
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
-
VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale
-
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
-
The Art of Efficient Reasoning: Data, Reward, and Optimization
-
World Guidance: World Modeling in Condition Space for Action Generatio
-
The Design Space of Tri-Modal Masked Diffusion Models
-
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
-
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
-
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
-
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
-
Test-Time Training with KV Binding Is Secretly Linear Attention
-
S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization
-
gQIR: Generative Quanta Image Reconstruction
-
Compositional Planning with Jumpy World Models
-
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
-
Toward the Thermodynamic Limit: Neural Operators for Non-equilibrium Dynamics of Mott Insulators
