Papers
-
Abstractive Red-Teaming of Language Model Character
-
Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset
-
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
-
Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models
-
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
-
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
-
Unsupervised decoding of encoded reasoning using language model interpretability
-
Anthropic Economic Index report: Uneven geographic and enterprise AI adoption
-
Signs of introspection in large language models
-
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
-
Poisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesAnthropic / Alan Turing Institute, Oxford Applied and Theoretical Machine Learning, Swiss Federal Institute of Technology in Zurich, UK AI Security Institute, University of Oxford
-
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Age
-
On the Biology of a Large Language Model
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
-
Alignment faking in large language models
-
Best-of-N Jailbreaking
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
-
Claude 3.5 Sonnet Model Card Addendum
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
-
The Claude 3 Model Family: Opus, Sonnet, and Haiku
-
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
-
Constitutional AI: Harmlessness from AI Feedback
-
Toy Models of Superposition
-
Softmax Linear Units
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
-
A Mathematical Framework for Transformer Circuits
