Papers
-
Video-Based Reward Modeling for Computer-Use AgentsAmazon / Mohamed bin Zayed University of Artificial Intelligence, University of Southern California, University of Washington
-
Interactive World Simulator for Robot Policy Training and Evaluation
-
SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic PlansAmazon / University of Illinois Urbana-Champaign, University of Massachusetts Amherst, University of Montreal, University of San Diego
-
RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object DetectionAmazon, Horizon Robotics / Hong Kong University of Science and Technology, Xi'an Jiaotong University
-
ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
-
The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
-
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
-
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
-
When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
-
CodeScout: Contextual Problem Statement Enhancement for Software Agents
-
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
-
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
-
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
-
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse TasksAmazon / Boston University, Carnegie Mellon University, Columbia University, Dartmouth College, Duke University, Michigan State University, Princeton University, Stanford University, The Ohio State University, University of California, University of Oxford, University of Southern California, University of Texas
-
Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots
-
Iterative Reranking as a Compute-Scaling Method for LLM-based Rankers
-
KG-CRAFT: Knowledge graph-based contrastive reasoning with LLMs for enhancing automated fact-checking
-
Pattern Discovery with Wide-Lens Analysis and Sharp-Focus Validation
-
Autoregressive Image Generation with Masked Bit Modeling
-
AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent
-
Interpretable Tabular Foundation Models via In-Context Kernel Regression
-
RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation
-
Differentiable Semantic ID for Generative Recommendation
-
AnyView: Synthesizing Any Novel View in Dynamic Scenes
-
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
-
TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback
-
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
-
ELLA: Efficient Lifelong Learning for Adapters
-
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
-
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
-
Diffusion Language Model Inference with Monte Carlo Tree Search
-
s3: You Don't Need That Much Data to Train a Search Agent via RL
-
A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
-
Chronos-2: From Univariate to Universal Forecasting
-
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
-
TabArena: A Living Benchmark for Machine Learning on Tabular Data
-
Amazon Ads Multi-Touch Attribution
-
Establishing Best Practices for Building Rigorous Agentic Benchmarks
-
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
-
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
-
Evaluating the Critical Risks of Amazon’s Nova Premier under the Frontier Model Safety Framework
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
-
HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases
-
M+: Extending MemoryLLM with Scalable Long-Term Memory
-
How Does Critical Batch Size Scale in Pre-training?
-
A Systematic Survey of Automatic Prompt Optimization Techniques
-
The Amazon Nova Family of Models: Technical Report and Model Card
-
Evaluating Nova 2.0 Lite model under Amazon’s Frontier Model Safety Framework
-
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
-
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
MongoDB - Build AI That Scales
