Papers
-
AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture
-
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
-
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL
-
BabyVision: Visual Reasoning Beyond Language
-
RigMo: Unifying Rig and Motion Learning for Generative Animation
-
AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines
-
FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments
-
UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos
-
Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation
-
One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection
-
GenCtrl -- A Formal Controllability Toolkit for Generative Models
-
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
-
Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
-
GR-Dexter Technical Report
-
Challenges and Research Directions for Large Language Model Inference Hardware
-
Pixel-Perfect Visual Geometry Estimation
-
DocDancer: Towards Agentic Document-Grounded Information Seeking
-
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
-
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
-
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
-
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
-
FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning
-
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
-
Extracting books from production language models
-
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
-
A Versatile Multimodal Agent for Multimedia Content Generation
-
Efficient Context Scaling with LongCat ZigZag Attention
-
Pearmut: Human Evaluation of Translation Made Trivial
-
Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
-
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents
-
CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models
-
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
-
RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning
-
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
-
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
-
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
-
ELLA: Efficient Lifelong Learning for Adapters
-
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
-
mHC: Manifold-Constrained Hyper-Connections
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
-
AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
-
Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and Shadows
-
NarrativeTrack: Evaluating Video Language Models Beyond the Frame
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
-
Delay-Tolerant Networking for Tsunami Evacuation on the Small Island of Hachijojima: A Study of Epidemic and Prophet Routing
-
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
-
HY-MT1.5 Technical Report
-
GARDO: Reinforcing Diffusion Models without Reward Hacking
-
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
-
ThinkGen: Generalized Thinking for Visual Generation
