Papers

Filter by company

Abstractive Red-Teaming of Language Model Character

Anthropic

Published on: 2026-02-12 1 author
Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset

Anthropic / Northeastern University, China

Published on: 2026-02-09 1 author
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Anthropic / University of Edinburgh

Published on: 2026-01-30
Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Anthropic / Northeastern University, China

Published on: 2026-01-28 1 author
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Anthropic / Stanford University

Published on: 2026-01-17 1 author
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Anthropic / University of Oxford

Published on: 2026-01-15 1 author
Unsupervised decoding of encoded reasoning using language model interpretability

Anthropic

Published on: 2025-12-06 1 author
Anthropic Economic Index report: Uneven geographic and enterprise AI adoption

Anthropic

Published on: 2025-11-19 1 author
Signs of introspection in large language models

Anthropic

Published on: 2025-10-29 1 author
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Anthropic

Published on: 2025-10-25 1 author
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Anthropic / Alan Turing Institute, Oxford Applied and Theoretical Machine Learning, Swiss Federal Institute of Technology in Zurich, UK AI Security Institute, University of Oxford

Published on: 2025-10-08 13 authors
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Age

Anthropic / Redwood Research

Published on: 2025-07-08 1 author
On the Biology of a Large Language Model

Anthropic

Published on: 2025-03-27 1 author
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Anthropic / Safeguards Research Team

Published on: 2025-01-31 1 author
Alignment faking in large language models

Anthropic / New York University

Published on: 2024-12-18 1 author
Best-of-N Jailbreaking

Anthropic, Tangentic / MATS, Stanford University, University College London, University of Oxford

Published on: 2024-12-04 10 authors
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Anthropic / University of Oxford

Published on: 2024-06-29 1 author
Claude 3.5 Sonnet Model Card Addendum

Anthropic

Published on: 2024-06-20 1 author
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Anthropic

Published on: 2024-05-21 1 author
The Claude 3 Model Family: Opus, Sonnet, and Haiku

Anthropic

Published on: 2024-03-04 1 author
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Anthropic

Published on: 2024-01-17 1 author
Constitutional AI: Harmlessness from AI Feedback

Anthropic

Published on: 2022-12-15 1 author
Toy Models of Superposition

Anthropic / Harvard University

Published on: 2022-09-14 1 author
Softmax Linear Units

Anthropic

Published on: 2022-06-27 1 author
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Anthropic

Published on: 2022-04-12 1 author
A Mathematical Framework for Transformer Circuits

Anthropic

Published on: 2021-12-22 1 author

Search

Papers

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: