Papers
-
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
-
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
-
ELLA: Efficient Lifelong Learning for Adapters
-
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
-
mHC: Manifold-Constrained Hyper-Connections
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
-
AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
-
Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and Shadows
-
NarrativeTrack: Evaluating Video Language Models Beyond the Frame
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
-
Delay-Tolerant Networking for Tsunami Evacuation on the Small Island of Hachijojima: A Study of Epidemic and Prophet Routing
-
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
-
HY-MT1.5 Technical Report
-
GARDO: Reinforcing Diffusion Models without Reward Hacking
-
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
-
ThinkGen: Generalized Thinking for Visual Generation
-
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
-
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
-
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
-
SemanticGen: Video Generation in Semantic Space
-
Streaming Video Instruction Tuning
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
-
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
-
GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
-
COBRA: Catastrophic Bit-flip Reliability Analysis of State-Space Models
-
From Word to World: Can Large Language Models be Implicit Text-based World Models?
-
Secret mixtures of experts inside your LLM
-
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
-
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
-
Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
-
Sigma-MoE-Tiny Technical Report
-
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
-
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.
-
DVGT: Driving Visual Geometry Transformer
-
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
-
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
-
Kling-Omni Technical Report
-
EasyV2V: A High-quality Instruction-based Video Editing Framework
-
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluatio
-
Addendum to GPT-5.2 System Card: GPT-5.2-Codex
-
Monitoring Monitorability
-
Spatia: Video Generation with Updatable Spatial Memory
-
Prompt Repetition Improves Non-Reasoning LLMs
-
Towards a Science of Scaling Agent Systems
-
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
-
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
-
GLM-TTS Technical Report
-
Native and Compact Structured Latents for 3D Generation
