Papers
-
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
-
Experiential Reinforcement Learning
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
-
Florence: A New Foundation Model for Computer Vision
-
LLM-in-Sandbox Elicits General Agentic Intelligence
-
On-Policy Context Distillation for Language Models
-
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
-
LIVE: Long-horizon Interactive Video World Modeling
-
Closing the Loop: Universal Repository Representation with RPG-Encoder
-
CUA-Skill: Develop Skills for Computer Using Agent
-
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
-
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
-
LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
-
Lost in Transmission: When and Why LLMs Fail to Reason Globally
-
Efficient Autoregressive Video Diffusion with Dummy Head
-
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
-
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
-
Controlled LLM Training on Spectral Sphere
-
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
-
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
-
From Word to World: Can Large Language Models be Implicit Text-based World Models?
-
Sigma-MoE-Tiny Technical Report
-
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
Spatia: Video Generation with Updatable Spatial Memory
-
Native and Compact Structured Latents for 3D Generation
-
Wait, Wait, Wait... Why Do Reasoning Models Loop?
-
Glance: Accelerating Diffusion Models with 1 Sample
-
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
-
VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM
-
The Art of Scaling Test-Time Compute for Large Language Models
-
ThetaEvolve: Test-time Learning on Open Problems
-
LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models
-
SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
-
Autoregressive Speech Synthesis without Vector Quantization
-
LLMs Get Lost In Multi-Turn Conversation
-
AI-Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools
-
AI at Work Is Here. Now Comes the Hard Part
-
Towards Making the Most of ChatGPT for Machine Translation
-
BitNet: Scaling 1-bit Transformers for Large Language Models
-
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
-
Phi-2: A Small Language Model with Reasoning Capability
-
Textbooks Are All You Need
-
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
-
ReWOO: Decoupling Reasoning from Observation for Efficient LLM Reasoning
-
Sparks of Artificial General Intelligence: Early experiments with GPT-4
-
Language Is Not All You Need: Aligning Perception with Language Models
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
-
LoRA: Low-Rank Adaptation of Large Language Models
