Papers
-
Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset
-
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
-
Unsupervised decoding of encoded reasoning using language model interpretability
-
Anthropic Economic Index report: Uneven geographic and enterprise AI adoption
-
Signs of introspection in large language models
-
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
-
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Age
-
On the Biology of a Large Language Model
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
-
Alignment faking in large language models
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
-
Claude 3.5 Sonnet Model Card Addendum
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
-
The Claude 3 Model Family: Opus, Sonnet, and Haiku
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
-
Constitutional AI: Harmlessness from AI Feedback
-
Toy Models of Superposition
-
Softmax Linear Units
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
-
A Mathematical Framework for Transformer Circuits
MongoDB - Build AI That Scales
