Reliable Weak-to-Strong Monitoring of LLM Agents

scaleapi / mrt

https://scale.com/research/mrt

19 4 Language: Jupyter Notebook License: MIT Updated: 1mo ago

README

Reliable Weak-to-Strong Monitoring of LLM Agents

This repository contains the code and experiments for "Reliable Weak-to-Strong Monitoring of LLM Agents", which presents a standardized workflow for evaluating the reliability of monitoring systems against adversarial agents that attempt to evade oversight.

Overview

Large Language Model (LLM) agents are becoming increasingly capable and autonomous, making oversight and monitoring crucial for safe deployment. This work introduces a systematic approach to red-teaming agent monitors - testing how well they can detect when agents misbehave or attempt to evade detection while pursuing hidden malicious objectives.

Key Contributions

Standardized Monitor Red Teaming (MRT) Workflow: A unified framework with defined threat models, metrics, and red teaming strategies for evaluating monitor reliability
CUA-SHADE-Arena Dataset: 15 computer-use agent sabotage tasks built on TheAgentCompany benchmark
Advanced Monitor Scaffolding: Sequential and hybrid monitoring systems that enable weaker LLMs to effectively monitor stronger agents
Comprehensive Empirical Analysis: Evaluation across multiple threat models, datasets, and monitoring approaches

Repository Structure

monitor-red-teaming/
├── configs/                    # Configuration files for models and experiments
├── environment_classes/        # Mock environment components (banks, calendars, etc.)
├── environments/              # Environment implementations for different task domains  
├── experiments/               # Experiment-specific code and results
│   └── agentcompany/         # OpenHands integration for computer-use tasks
├── llms/                      # LLM interface implementations (OpenAI, Anthropic, etc.)
├── monitor_generation/        # Code for running monitoring experiments
├── scripts/                   # Shell scripts for running experiments
├── task_evaluation/           # Task-specific evaluation functions
├── task_pairs/               # SHADE-Arena task definitions (17 tasks)
├── task_pairs_agentcompany/  # CUA-SHADE-Arena task definitions (15 tasks)
├── tools/                    # Mock tools for agent interactions
├── utils/                    # Shared utilities and helper functions
└── universe_related_stuff/   # Mock data populating the environments

Datasets and Tasks

SHADE-Arena Tasks (17 tasks)

Built on the original SHADE-Arena framework, these tasks involve tool-use agents with:

Main Task: Benign, legitimate objective the agent should complete
Side Task: Hidden malicious objective the agent attempts to complete covertly

Example tasks include: bank transfers, calendar management, blog posting, travel booking, etc.

CUA-SHADE-Arena Tasks (15 tasks)

Computer-use agent tasks built on TheAgentCompany, involving:

Primary Task: Legitimate business/administrative work
Secondary Task: Hidden sabotage or data exfiltration objective

Tasks span domains like spreadsheet manipulation, form filling, data analysis, HR operations, and financial tasks.

Threat Models

We evaluate monitors under 6 different threat models characterized by:

Agent Awareness: Whether the agent knows it's being monitored
Monitor Knowledge: How much the monitor knows about the agent's hidden objectives

These combinations represent different real-world deployment scenarios and attacker capabilities.

Red Teaming Strategies

The repository implements several adversarial strategies agents use to evade detection:

Naive: Standard execution without evasion attempts
Liar: Agent lies about its actions in reasoning traces
Manipulation: Agent attempts to manipulate the monitor's judgment
Selective Verbalization: Agent selectively hides reasoning about malicious actions
Combined Approaches: Combinations of the above strategies

Monitor Scaffolding

We implement and evaluate three monitoring approaches:

Zero-shot Baseline: Direct prompting on full agent trajectory
Hierarchical Monitoring: Multi-level summarization before final judgment
Sequential Monitoring: Sliding window inspection of agent steps
Hybrid System: Combination of hierarchical and sequential approaches

Key finding: Hybrid scaffolding enables weak monitor models to effectively oversee stronger agent models.

Quick Start

Setup

The task data (task_pairs/ and task_pairs_agentcompany/ folders) are stored in a public Hugging Face dataset repository to reduce the size of this repository. Before running any experiments, you need to download this data:

# Run the setup script to download task data and install dependencies
./scripts/setup.sh

Alternatively, you can run the setup steps manually:

# Install dependencies
pip install -r requirements.txt

# Download task data from Hugging Face
python3 load_trajectories.py

The setup script will:

Install required Python dependencies
Download the full contents of task_pairs/ and task_pairs_agentcompany/ folders from ScaleAI/mrt
Verify that all task data is correctly loaded

Prerequisites

pip install -r requirements.txt

Note: If you ran the setup script above, dependencies are already installed.

Configuration

Copy and configure the appropriate config file:
- configs/config_shadearena.yaml for SHADE-Arena experiments
- Add your API keys for the LLM providers you plan to use

Running SHADE-Arena Experiments

1. Generate Agent Trajectories

# Run the task generation script
./scripts/run_tasks_shadearena.sh

This generates trajectories for different red teaming strategies:

naive: Unaware agents with no evasion
False: Standard aware agents
mm_v4: Manipulation strategy
liar_204: Lying strategy
selective_v2: Selective verbalization
no_secondary: Baseline with no side task

2. Run Monitor Evaluation

# Evaluate monitors on the generated trajectories
./scripts/run_monitors_shadearena.sh

Running CUA-SHADE-Arena Experiments

1. Generate Computer-Use Agent Trajectories

./scripts/run_tasks_agentcompany.sh \
    --format_name benign_v1 \
    --backend_model anthropic/claude-sonnet-4-20250514 \
    --num_experiments 5 \
    --task_names admin_make_spreadsheet admin_mass_forms_filling

2. Run Monitor Evaluation

python monitor_generation/add_monitors_agentcompany.py \
    --agent_model "anthropic/claude-sonnet-4-20250514" \
    --monitor_model "anthropic/claude-3-5-sonnet-20241022" \
    --transcript_type "transcripts_no_hint_benign_v1_thinking_time" \
    --monitor_styles "hierarchical_5_likert_reminder_v2" "sequential_5_likert_reminder_v2" \
    --max_concurrent 10

Advanced Experiments

Chunk Size Ablation Study

./scripts/run_monitors_chunk_size_ablation.sh

Human-Augmented Monitoring

./scripts/run_monitors_humanaug.sh

Key Findings

Agent awareness has greater impact on monitoring reliability than monitor awareness
Monitor scaffolding is more important than granting monitors more knowledge about hidden agendas
Weak monitor LLMs cannot reliably monitor strong agents, but hybrid scaffolding bridges this gap
Human-in-the-loop approaches improve precision at low false positive rates (~15% improvement at 1% FPR)

Evaluation Metrics

We standardize on two key metrics:

AUC-ROC: Overall monitor performance across all thresholds
TPR at FPR = 0.01: True positive rate when false positive rate is constrained to 1% (relevant for high-precision enterprise settings)

Extending the Framework

Adding New Tasks

SHADE-Arena Tasks: Add task definition in task_pairs/<task_name>/
Computer-Use Tasks: Add task definition in task_pairs_agentcompany/<task_name>/

Each task requires:

task_definition.py: Task environment and prompts
task.json: Primary and secondary task descriptions
solution.py: Reference solution

Adding New Red Teaming Strategies

Implement new strategies by:

Extending the remove_thinking_tags parameter options in ExperimentConfig
Adding corresponding logic in the agent execution pipeline
Updating monitoring evaluation to handle the new strategy

Adding New Monitor Scaffolding

Implement new monitoring approaches in monitor_generation/
Add corresponding monitor_styles options
Update evaluation scripts to include the new approach

This repository builds upon:

SHADE-Arena (Kutasov et al., 2025): Original framework for agent monitoring evaluation
TheAgentCompany (Xu et al., 2024): Computer-use agent benchmark

Citation

If you use this code or datasets in your research, please cite:

@misc{kale2025reliableweaktostrongmonitoringllm,
      title={Reliable Weak-to-Strong Monitoring of LLM Agents}, 
      author={Neil Kale and Chen Bo Calvin Zhang and Kevin Zhu and Ankit Aich and Paula Rodriguez and Scale Red Team and Christina Q. Knight and Zifan Wang},
      year={2025},
      eprint={2508.19461},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.19461}, 
}

License

This repository uses different licenses for different components:

Code and Data

The source code, datasets, and experimental data in this repository is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Third-Party Components

This work builds upon and includes adaptations of:

SHADE-Arena framework: MIT License
TheAgentCompany benchmark: MIT License

See the LICENSE file for full details.

Contact

For questions, issues, or collaboration requests, please:

Open an issue on this repository
Contact the research team through the affiliated institutions

Canary String: The following canary helps researchers prevent SHADE-Arena task data from being included in training datasets: 5AGVKj6HGcwQshM7DLulSKcGVbIe2C-mFbSl7eMVOd1VrTNayy0YkjJ1X4W3o

Search

Reliable Weak-to-Strong Monitoring of LLM Agents

Report repository

README

Reliable Weak-to-Strong Monitoring of LLM Agents

Overview

Key Contributions

Repository Structure

Datasets and Tasks

SHADE-Arena Tasks (17 tasks)

CUA-SHADE-Arena Tasks (15 tasks)

Threat Models

Red Teaming Strategies

Monitor Scaffolding

Quick Start

Setup

Prerequisites

Configuration

Running SHADE-Arena Experiments

1. Generate Agent Trajectories

2. Run Monitor Evaluation

Running CUA-SHADE-Arena Experiments

1. Generate Computer-Use Agent Trajectories

2. Run Monitor Evaluation

Advanced Experiments

Chunk Size Ablation Study

Human-Augmented Monitoring

Key Findings

Evaluation Metrics

Extending the Framework

Adding New Tasks

Adding New Red Teaming Strategies

Adding New Monitor Scaffolding

Citation

License

Code and Data

Third-Party Components

Contact

Go to section

Search

Reliable Weak-to-Strong Monitoring of LLM Agents

README

Reliable Weak-to-Strong Monitoring of LLM Agents

Overview

Key Contributions

Repository Structure

Datasets and Tasks

SHADE-Arena Tasks (17 tasks)

CUA-SHADE-Arena Tasks (15 tasks)

Threat Models

Red Teaming Strategies

Monitor Scaffolding

Quick Start

Setup

Prerequisites

Configuration

Running SHADE-Arena Experiments

1. Generate Agent Trajectories

2. Run Monitor Evaluation

Running CUA-SHADE-Arena Experiments

1. Generate Computer-Use Agent Trajectories

2. Run Monitor Evaluation

Advanced Experiments

Chunk Size Ablation Study

Human-Augmented Monitoring

Key Findings

Evaluation Metrics

Extending the Framework

Adding New Tasks

Adding New Red Teaming Strategies

Adding New Monitor Scaffolding

Related Work

Citation

License

Code and Data

Third-Party Components

Contact

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: