Papers
-
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
-
Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
-
Sigma-MoE-Tiny Technical Report
-
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
-
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.
-
DVGT: Driving Visual Geometry Transformer
-
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
-
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
-
Kling-Omni Technical Report
-
EasyV2V: A High-quality Instruction-based Video Editing Framework
-
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluatio
-
Addendum to GPT-5.2 System Card: GPT-5.2-Codex
-
Monitoring Monitorability
-
Spatia: Video Generation with Updatable Spatial Memory
-
Prompt Repetition Improves Non-Reasoning LLMs
-
Towards a Science of Scaling Agent Systems
-
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
-
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
-
GLM-TTS Technical Report
-
Native and Compact Structured Latents for 3D Generation
-
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
-
T5Gemma 2: Seeing, Reading, and Understanding Longer
-
Evaluating AI’s ability to perform scientific research tasks
-
AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path
-
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
-
Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10
-
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
-
KlingAvatar 2.0 Technical Report
-
Wait, Wait, Wait... Why Do Reasoning Models Loop?
-
World Models Can Leverage Human Videos for Dexterous Manipulation
-
Towards Scalable Pre-training of Visual Tokenizers for Generation
-
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
-
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
-
Diffusion Language Model Inference with Monte Carlo Tree Search
-
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
-
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
-
Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
-
CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
-
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
-
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
-
Glance: Accelerating Diffusion Models with 1 Sample
-
Sharp Monocular View Synthesis in Less Than a Second
-
On Learning-Curve Monotonicity for Maximum Likelihood Estimators
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
-
UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving
-
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
-
PAVAS: Physics-Aware Video-to-Audio Synthesis
MongoDB - Build AI That Scales
