TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MemoryBench

supermemoryai / memorybench

Unified benchmark for evaluating conversational memory and RAG across multiple datasets

209 42 Language: TypeScript License: MIT Updated: 3mo ago

README

MemoryBench

A pluggable benchmarking framework for evaluating memory and context systems.

original

Features

  • ๐Ÿ”Œ Interoperable: mix and match any provider with any benchmark
  • ๐Ÿงฉ Bring your own benchmarks: plug in custom datasets and tasks
  • โ™ป๏ธ Checkpointed runs: resume from any pipeline stage (ingest โ†’ index โ†’ search โ†’ answer โ†’ evaluate)
  • ๐Ÿ†š Multiโ€‘provider comparison: run the same benchmark across providers sideโ€‘byโ€‘side
  • ๐Ÿงช Judgeโ€‘agnostic: swap GPTโ€‘4o, Claude, Gemini, etc. without code changes
  • ๐Ÿ“Š Structured reports: export run status, failures, and metrics for analysis
  • ๐Ÿ–ฅ๏ธ Web UI: inspect runs, questions, and failures interactively, in real-time!
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Benchmarks โ”‚    โ”‚  Providers  โ”‚    โ”‚   Judges    โ”‚
โ”‚  (LoCoMo,   โ”‚    โ”‚ (Supermem,  โ”‚    โ”‚  (GPT-4o,   โ”‚
โ”‚  LongMem..) โ”‚    โ”‚  Mem0, Zep) โ”‚    โ”‚  Claude..)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ–ผ
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚      MemoryBench      โ”‚
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Ingest โ”‚ Indexingโ”‚ Search โ”‚  Answer  โ”‚Evaluateโ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Quick Start

bun install
cp .env.example .env.local  # Add your API keys
bun run src/index.ts run -p supermemory -b locomo

Configuration

# Providers (at least one)
SUPERMEMORY_API_KEY=
MEM0_API_KEY=
ZEP_API_KEY=

# Judges (at least one)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=

Commands

Command Description
run Full pipeline: ingest โ†’ index โ†’ search โ†’ answer โ†’ evaluate โ†’ report
compare Run benchmark across multiple providers simultaneously
ingest Ingest benchmark data into provider
search Run search phase only
test Test single question
status Check run progress
list-questions Browse benchmark questions
show-failures Debug failed questions
serve Start web UI
help Show help (help providers, help models, help benchmarks)

Options

-p, --provider         Memory provider (supermemory, mem0, zep)
-b, --benchmark        Benchmark (locomo, longmemeval, convomem)
-j, --judge            Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.)
-r, --run-id           Run identifier (auto-generated if omitted)
-m, --answering-model  Model for answer generation (default: gpt-4o)
-l, --limit            Limit number of questions
-q, --question-id      Specific question (for test command)
--force                Clear checkpoint and restart

Examples

# Full run
bun run src/index.ts run -p mem0 -b locomo

# With custom run ID
bun run src/index.ts run -p mem0 -b locomo -r my-test

# Resume existing run
bun run src/index.ts run -r my-test

# Limited questions
bun run src/index.ts run -p supermemory -b locomo -l 10

# Different models
bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash

# Compare multiple providers
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5

# Test single question
bun run src/index.ts test -r my-test -q question_42

# Debug
bun run src/index.ts status -r my-test
bun run src/index.ts show-failures -r my-test

Pipeline

1. INGEST    Load benchmark sessions โ†’ Push to provider
2. INDEX     Wait for provider indexing
3. SEARCH    Query provider โ†’ Retrieve context
4. ANSWER    Build prompt โ†’ Generate answer via LLM
5. EVALUATE  Compare to ground truth โ†’ Score via judge
6. REPORT    Aggregate scores โ†’ Output accuracy + latency

Each phase checkpoints independently. Failed runs resume from last successful point.

Checkpointing

Runs persist to data/runs/{runId}/:

  • checkpoint.json - Run state and progress
  • results/ - Search results per question
  • report.json - Final report

Re-running same ID resumes. Use --force to restart.

Extending

Component Guide
Add Provider src/providers/README.md
Add Benchmark src/benchmarks/README.md
Add Judge src/judges/README.md
Project Structure src/README.md

License

MIT

0 AIs selected
Clear selection
#
Name
Task