MemoryBench
Unified benchmark for evaluating conversational memory and RAG across multiple datasets
README
MemoryBench
A pluggable benchmarking framework for evaluating memory and context systems.
Features
- ๐ Interoperable: mix and match any provider with any benchmark
- ๐งฉ Bring your own benchmarks: plug in custom datasets and tasks
- โป๏ธ Checkpointed runs: resume from any pipeline stage (ingest โ index โ search โ answer โ evaluate)
- ๐ Multiโprovider comparison: run the same benchmark across providers sideโbyโside
- ๐งช Judgeโagnostic: swap GPTโ4o, Claude, Gemini, etc. without code changes
- ๐ Structured reports: export run status, failures, and metrics for analysis
- ๐ฅ๏ธ Web UI: inspect runs, questions, and failures interactively, in real-time!
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Benchmarks โ โ Providers โ โ Judges โ
โ (LoCoMo, โ โ (Supermem, โ โ (GPT-4o, โ
โ LongMem..) โ โ Mem0, Zep) โ โ Claude..) โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ MemoryBench โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโ
โ Ingest โ Indexingโ Search โ Answer โEvaluateโ
โโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโ
Quick Start
bun install
cp .env.example .env.local # Add your API keys
bun run src/index.ts run -p supermemory -b locomo
Configuration
# Providers (at least one)
SUPERMEMORY_API_KEY=
MEM0_API_KEY=
ZEP_API_KEY=
# Judges (at least one)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=
Commands
| Command | Description |
|---|---|
run |
Full pipeline: ingest โ index โ search โ answer โ evaluate โ report |
compare |
Run benchmark across multiple providers simultaneously |
ingest |
Ingest benchmark data into provider |
search |
Run search phase only |
test |
Test single question |
status |
Check run progress |
list-questions |
Browse benchmark questions |
show-failures |
Debug failed questions |
serve |
Start web UI |
help |
Show help (help providers, help models, help benchmarks) |
Options
-p, --provider Memory provider (supermemory, mem0, zep)
-b, --benchmark Benchmark (locomo, longmemeval, convomem)
-j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.)
-r, --run-id Run identifier (auto-generated if omitted)
-m, --answering-model Model for answer generation (default: gpt-4o)
-l, --limit Limit number of questions
-q, --question-id Specific question (for test command)
--force Clear checkpoint and restart
Examples
# Full run
bun run src/index.ts run -p mem0 -b locomo
# With custom run ID
bun run src/index.ts run -p mem0 -b locomo -r my-test
# Resume existing run
bun run src/index.ts run -r my-test
# Limited questions
bun run src/index.ts run -p supermemory -b locomo -l 10
# Different models
bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash
# Compare multiple providers
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5
# Test single question
bun run src/index.ts test -r my-test -q question_42
# Debug
bun run src/index.ts status -r my-test
bun run src/index.ts show-failures -r my-test
Pipeline
1. INGEST Load benchmark sessions โ Push to provider
2. INDEX Wait for provider indexing
3. SEARCH Query provider โ Retrieve context
4. ANSWER Build prompt โ Generate answer via LLM
5. EVALUATE Compare to ground truth โ Score via judge
6. REPORT Aggregate scores โ Output accuracy + latency
Each phase checkpoints independently. Failed runs resume from last successful point.
Checkpointing
Runs persist to data/runs/{runId}/:
checkpoint.json- Run state and progressresults/- Search results per questionreport.json- Final report
Re-running same ID resumes. Use --force to restart.
Extending
| Component | Guide |
|---|---|
| Add Provider | src/providers/README.md |
| Add Benchmark | src/benchmarks/README.md |
| Add Judge | src/judges/README.md |
| Project Structure | src/README.md |
License
MIT
