Papers
-
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
-
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
-
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
-
SemanticGen: Video Generation in Semantic Space
-
Streaming Video Instruction Tuning
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
-
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
-
GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
-
COBRA: Catastrophic Bit-flip Reliability Analysis of State-Space Models
-
From Word to World: Can Large Language Models be Implicit Text-based World Models?
-
Secret mixtures of experts inside your LLMUniversity of Pennsylvania, Wharton School of Statistics and Data Science
-
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
-
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
-
Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
-
Sigma-MoE-Tiny Technical Report
-
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
-
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.
-
DVGT: Driving Visual Geometry Transformer
-
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
-
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
-
Kling-Omni Technical Report
-
EasyV2V: A High-quality Instruction-based Video Editing Framework
-
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluatio
-
Addendum to GPT-5.2 System Card: GPT-5.2-Codex
-
Monitoring Monitorability
-
Spatia: Video Generation with Updatable Spatial Memory
-
Prompt Repetition Improves Non-Reasoning LLMs
-
Towards a Science of Scaling Agent Systems
-
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
-
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
-
GLM-TTS Technical Report
-
Native and Compact Structured Latents for 3D Generation
-
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
-
T5Gemma 2: Seeing, Reading, and Understanding Longer
-
Evaluating AI’s ability to perform scientific research tasks
-
AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path
-
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
-
Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10
-
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
-
KlingAvatar 2.0 Technical Report
-
Wait, Wait, Wait... Why Do Reasoning Models Loop?
-
World Models Can Leverage Human Videos for Dexterous Manipulation
-
Towards Scalable Pre-training of Visual Tokenizers for Generation
-
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
-
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
-
Diffusion Language Model Inference with Monte Carlo Tree Search
-
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
