Reliable Weak-to-Strong Monitoring of LLM Agents
https://scale.com/research/mrt
README
Reliable Weak-to-Strong Monitoring of LLM Agents
This repository contains the code and experiments for "Reliable Weak-to-Strong Monitoring of LLM Agents", which presents a standardized workflow for evaluating the reliability of monitoring systems against adversarial agents that attempt to evade oversight.
Overview
Large Language Model (LLM) agents are becoming increasingly capable and autonomous, making oversight and monitoring crucial for safe deployment. This work introduces a systematic approach to red-teaming agent monitors - testing how well they can detect when agents misbehave or attempt to evade detection while pursuing hidden malicious objectives.
Key Contributions
- Standardized Monitor Red Teaming (MRT) Workflow: A unified framework with defined threat models, metrics, and red teaming strategies for evaluating monitor reliability
- CUA-SHADE-Arena Dataset: 15 computer-use agent sabotage tasks built on TheAgentCompany benchmark
- Advanced Monitor Scaffolding: Sequential and hybrid monitoring systems that enable weaker LLMs to effectively monitor stronger agents
- Comprehensive Empirical Analysis: Evaluation across multiple threat models, datasets, and monitoring approaches
Repository Structure
monitor-red-teaming/
โโโ configs/ # Configuration files for models and experiments
โโโ environment_classes/ # Mock environment components (banks, calendars, etc.)
โโโ environments/ # Environment implementations for different task domains
โโโ experiments/ # Experiment-specific code and results
โ โโโ agentcompany/ # OpenHands integration for computer-use tasks
โโโ llms/ # LLM interface implementations (OpenAI, Anthropic, etc.)
โโโ monitor_generation/ # Code for running monitoring experiments
โโโ scripts/ # Shell scripts for running experiments
โโโ task_evaluation/ # Task-specific evaluation functions
โโโ task_pairs/ # SHADE-Arena task definitions (17 tasks)
โโโ task_pairs_agentcompany/ # CUA-SHADE-Arena task definitions (15 tasks)
โโโ tools/ # Mock tools for agent interactions
โโโ utils/ # Shared utilities and helper functions
โโโ universe_related_stuff/ # Mock data populating the environments
Datasets and Tasks
SHADE-Arena Tasks (17 tasks)
Built on the original SHADE-Arena framework, these tasks involve tool-use agents with:
- Main Task: Benign, legitimate objective the agent should complete
- Side Task: Hidden malicious objective the agent attempts to complete covertly
Example tasks include: bank transfers, calendar management, blog posting, travel booking, etc.
CUA-SHADE-Arena Tasks (15 tasks)
Computer-use agent tasks built on TheAgentCompany, involving:
- Primary Task: Legitimate business/administrative work
- Secondary Task: Hidden sabotage or data exfiltration objective
Tasks span domains like spreadsheet manipulation, form filling, data analysis, HR operations, and financial tasks.
Threat Models
We evaluate monitors under 6 different threat models characterized by:
- Agent Awareness: Whether the agent knows it's being monitored
- Monitor Knowledge: How much the monitor knows about the agent's hidden objectives
These combinations represent different real-world deployment scenarios and attacker capabilities.
Red Teaming Strategies
The repository implements several adversarial strategies agents use to evade detection:
- Naive: Standard execution without evasion attempts
- Liar: Agent lies about its actions in reasoning traces
- Manipulation: Agent attempts to manipulate the monitor's judgment
- Selective Verbalization: Agent selectively hides reasoning about malicious actions
- Combined Approaches: Combinations of the above strategies
Monitor Scaffolding
We implement and evaluate three monitoring approaches:
- Zero-shot Baseline: Direct prompting on full agent trajectory
- Hierarchical Monitoring: Multi-level summarization before final judgment
- Sequential Monitoring: Sliding window inspection of agent steps
- Hybrid System: Combination of hierarchical and sequential approaches
Key finding: Hybrid scaffolding enables weak monitor models to effectively oversee stronger agent models.
Quick Start
Setup
The task data (task_pairs/ and task_pairs_agentcompany/ folders) are stored in a public Hugging Face dataset repository to reduce the size of this repository. Before running any experiments, you need to download this data:
# Run the setup script to download task data and install dependencies
./scripts/setup.sh
Alternatively, you can run the setup steps manually:
# Install dependencies
pip install -r requirements.txt
# Download task data from Hugging Face
python3 load_trajectories.py
The setup script will:
- Install required Python dependencies
- Download the full contents of
task_pairs/andtask_pairs_agentcompany/folders from ScaleAI/mrt - Verify that all task data is correctly loaded
Prerequisites
pip install -r requirements.txt
Note: If you ran the setup script above, dependencies are already installed.
Configuration
- Copy and configure the appropriate config file:
configs/config_shadearena.yamlfor SHADE-Arena experiments- Add your API keys for the LLM providers you plan to use
Running SHADE-Arena Experiments
1. Generate Agent Trajectories
# Run the task generation script
./scripts/run_tasks_shadearena.sh
This generates trajectories for different red teaming strategies:
naive: Unaware agents with no evasionFalse: Standard aware agentsmm_v4: Manipulation strategyliar_204: Lying strategyselective_v2: Selective verbalizationno_secondary: Baseline with no side task
2. Run Monitor Evaluation
# Evaluate monitors on the generated trajectories
./scripts/run_monitors_shadearena.sh
Running CUA-SHADE-Arena Experiments
1. Generate Computer-Use Agent Trajectories
./scripts/run_tasks_agentcompany.sh \
--format_name benign_v1 \
--backend_model anthropic/claude-sonnet-4-20250514 \
--num_experiments 5 \
--task_names admin_make_spreadsheet admin_mass_forms_filling
2. Run Monitor Evaluation
python monitor_generation/add_monitors_agentcompany.py \
--agent_model "anthropic/claude-sonnet-4-20250514" \
--monitor_model "anthropic/claude-3-5-sonnet-20241022" \
--transcript_type "transcripts_no_hint_benign_v1_thinking_time" \
--monitor_styles "hierarchical_5_likert_reminder_v2" "sequential_5_likert_reminder_v2" \
--max_concurrent 10
Advanced Experiments
Chunk Size Ablation Study
./scripts/run_monitors_chunk_size_ablation.sh
Human-Augmented Monitoring
./scripts/run_monitors_humanaug.sh
Key Findings
- Agent awareness has greater impact on monitoring reliability than monitor awareness
- Monitor scaffolding is more important than granting monitors more knowledge about hidden agendas
- Weak monitor LLMs cannot reliably monitor strong agents, but hybrid scaffolding bridges this gap
- Human-in-the-loop approaches improve precision at low false positive rates (~15% improvement at 1% FPR)
Evaluation Metrics
We standardize on two key metrics:
- AUC-ROC: Overall monitor performance across all thresholds
- TPR at FPR = 0.01: True positive rate when false positive rate is constrained to 1% (relevant for high-precision enterprise settings)
Extending the Framework
Adding New Tasks
- SHADE-Arena Tasks: Add task definition in
task_pairs/<task_name>/ - Computer-Use Tasks: Add task definition in
task_pairs_agentcompany/<task_name>/
Each task requires:
task_definition.py: Task environment and promptstask.json: Primary and secondary task descriptionssolution.py: Reference solution
Adding New Red Teaming Strategies
Implement new strategies by:
- Extending the
remove_thinking_tagsparameter options inExperimentConfig - Adding corresponding logic in the agent execution pipeline
- Updating monitoring evaluation to handle the new strategy
Adding New Monitor Scaffolding
- Implement new monitoring approaches in
monitor_generation/ - Add corresponding
monitor_stylesoptions - Update evaluation scripts to include the new approach
Related Work
This repository builds upon:
- SHADE-Arena (Kutasov et al., 2025): Original framework for agent monitoring evaluation
- TheAgentCompany (Xu et al., 2024): Computer-use agent benchmark
Citation
If you use this code or datasets in your research, please cite:
@misc{kale2025reliableweaktostrongmonitoringllm,
title={Reliable Weak-to-Strong Monitoring of LLM Agents},
author={Neil Kale and Chen Bo Calvin Zhang and Kevin Zhu and Ankit Aich and Paula Rodriguez and Scale Red Team and Christina Q. Knight and Zifan Wang},
year={2025},
eprint={2508.19461},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.19461},
}
License
This repository uses different licenses for different components:
Code and Data
The source code, datasets, and experimental data in this repository is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
Third-Party Components
This work builds upon and includes adaptations of:
- SHADE-Arena framework: MIT License
- TheAgentCompany benchmark: MIT License
See the LICENSE file for full details.
Contact
For questions, issues, or collaboration requests, please:
- Open an issue on this repository
- Contact the research team through the affiliated institutions
Canary String: The following canary helps researchers prevent SHADE-Arena task data from being included in training datasets: 5AGVKj6HGcwQshM7DLulSKcGVbIe2C-mFbSl7eMVOd1VrTNayy0YkjJ1X4W3o
