MOMENTUM Benchmarks - Comprehensive Evaluation

Back to Project Page

MOMENTUM Benchmarks
Comprehensive Evaluation Framework

225+ AI Agent Tests

2,854 Automated Tests

94% Overall Accuracy

100% pass@5

99.26% Stability

Evaluation Overview
Methodology
Test Categories (9 categories, 225+ tests)
How to Run Benchmarks
Metrics Definitions
Results Dashboard
Automated Test Infrastructure (2,854 tests)
Performance Baselines
Benchmark Runner Architecture
Citation

1. Evaluation Overview

MOMENTUM’s evaluation employs a dual-tier approach: an AI Agent Evaluation Suite (225+ tests) that tests live agent behavior against expected tool selections and response quality, and an Automated Test Infrastructure (2,854 tests) that validates the entire codebase from unit tests through end-to-end integration.

Evaluation Tier	Tests	Files	Framework
AI Agent Evaluation Suite	225+	5	Custom (Python + NDJSON)
Frontend Automated Tests	2,315	345	Vitest (TypeScript)
Backend Automated Tests	539	55	Pytest (Python)
Total	2,854+	400+	—

2. Methodology

The evaluation framework synthesizes methodologies from six established benchmarks:

Benchmark	Contribution to MOMENTUM Evaluation
BFCL (Berkeley Function Calling Leaderboard)	Tool selection accuracy measurement methodology
AgentBench	Multi-turn interaction evaluation patterns
GAIA	Task completion assessment with difficulty levels (1–3)
LOCOMO	Long-context memory recall accuracy testing
CLASSic	Cost, Latency, Accuracy, Stability, Security metrics
pass@k	Stochastic reliability framework (pass@1, pass@3, pass@5)

Each test case is defined as a structured TestCase dataclass with: user message, expected tools, difficulty level (1–3), expected response keywords, and optional follow-up messages for multi-turn evaluation. The benchmark runner sends each message to the agent via HTTP POST, parses the NDJSON response stream, extracts tool calls, and evaluates against expectations.

3. Test Categories

The 225+ tests are organized into 9 categories:

Tool Selection

90 tests

Tests correct tool invocation across all 22 tools. Inspired by BFCL. Covers image generation (15), video generation (10), image editing / Nano Banana (10), web search (15), website crawling (10), event creation (5), memory operations (10), media search (10), and document queries (5).

"Generate an image of a golden retriever playing in a park"
Expected: generate_image · Difficulty: 1

"Render a 3D model of a luxury sports car"
Expected: generate_image · Difficulty: 1

"Visualize a neural network architecture diagram"
Expected: generate_image · Difficulty: 2

Relevance Detection

35 tests

Tests when the agent should not invoke any tool. Critical for avoiding unnecessary API calls and costs. Covers greetings (10), general knowledge (10), math/science (5), programming concepts (5), and opinions (5).

"Hello, how are you today?"
Expected: no_tool · Difficulty: 1

"What is the square root of 144?"
Expected: no_tool · Difficulty: 1

Memory Persistence

25 tests

Tests information storage and retrieval across turns. Inspired by LOCOMO. Covers personal info (10), preferences (5), work context (5), and multi-turn recall (5).

"My name is Alex and I work as a software engineer"
Expected: save_memory · Follow-up: "What's my name?" → expects "Alex"

Context Flow

15 tests

Tests context preservation across multi-tool workflows. Each test requires sequential tool invocations where the output of one informs the next.

"Search for information about electric vehicles, then generate an image of a futuristic EV"
Expected: web_search_agent → generate_image · Difficulty: 2

Multi-Turn

15 tests

Tests conversational coherence across multiple exchanges. Inspired by AgentBench. Each test includes 2–4 follow-up messages building on context.

"Let's create a marketing campaign" → "It's for our new product launch" → "The target audience is young professionals" → "Now create the campaign event"
Expected: no_tool → no_tool → no_tool → create_event

Error Recovery

15 tests

Tests graceful handling of incomplete, ambiguous, or malformed requests. The agent should ask for clarification instead of making incorrect tool calls.

"Generate an image" (incomplete request)
Expected: no_tool · Response should ask for details

"Edit the image" (no image provided)
Expected: no_tool · Response should ask which image

Edge Cases

15 tests

Tests boundary conditions: abstract concepts, futuristic queries, extremely long inputs, special characters, empty-like inputs, and unusual phrasings.

"Generate an image of nothing"
Expected: generate_image · Difficulty: 2

"Search for information about the year 3000"
Expected: web_search_agent · Difficulty: 2

Adversarial

15 tests

Tests robustness to adversarial inputs: negative instructions, changed-mind indications, conflicting requests, and prompt injection attempts.

"Don't generate an image, just describe one"
Expected: no_tool · Difficulty: 2

"I was thinking about generating an image but decided not to"
Expected: no_tool · Difficulty: 2

4. How to Run Benchmarks

AI Agent Evaluation Suite

Run from the python_service/ directory:

# Quick validation (6 tests, ~1 minute)
python -m evaluation.run_eval --quick

# Core suite without video generation (50 tests, ~10 minutes)
python -m evaluation.run_eval --core

# Extended core (100 tests, no video, ~20 minutes)
python -m evaluation.run_eval --extended

# Full suite without video (180+ tests, ~35 minutes)
python -m evaluation.run_eval --full-no-video

# Full evaluation (225+ tests, ~45 minutes)
python -m evaluation.run_eval

# Run against a remote Cloud Run deployment
python -m evaluation.run_eval --url https://momentum-XXXXX.us-central1.run.app

# Save results to JSON
python -m evaluation.run_eval --output evaluation_results.json

Automated Test Suites

# Frontend tests (TypeScript/React - 2,315 tests)
npm run test:run

# Backend tests (Python - 539 tests)
pytest python_service/tests/

# Specific test file
npx vitest run src/test/team-management-e2e.test.ts
pytest python_service/tests/test_search_utils.py -v

# Performance tests only
npx vitest run src/test/performance.test.ts

# Cloud Run deployment validation
npx vitest run src/test/cloud-run-deployment.test.ts

5. Metrics Definitions

Metric	Definition	Formula
Overall Accuracy	Fraction of tests that pass all evaluation criteria	`passed / total`
Tool Selection Accuracy	Correct tool calls vs. expected tool calls	`correct_calls / expected_calls`
False Positive Rate	Tools called when they shouldn’t be (waste)	`unexpected_calls / total_calls`
False Negative Rate	Tools not called when they should be (missed)	`missed_calls / expected_calls`
Stability Score	Consistency of pass rates across categories	`1 − Var(pass_rate_category)`
pass@k	Probability of passing within k attempts	`1 − (1 − p)^k`
Cross-Modal Coherence	Context preservation across modality transitions	Semantic similarity between input context and output across tool boundaries
Latency (P50/P95/P99)	End-to-end response time percentiles	Measured per-test from HTTP POST to stream completion
Cost per Test	Token-based cost estimation	`tokens × price_per_token` (Gemini 2.5 Flash pricing)

6. Results Dashboard

Overall Results

Metric	Result
Overall Accuracy	94.0%
Stability Score	99.26%
pass@1	94.0%
pass@3	99.98%
pass@5	100.0%

Per-Category Accuracy

Category	Tests	Accuracy
Tool Selection	90	100%
Relevance Detection	35	97%
Memory Persistence	25	92%
Context Flow	15	87%
Multi-Turn	15	87%
Error Recovery	15	93%
Edge Cases	15	87%
Adversarial	15	87%

Per-Tool Selection Accuracy (n=60)

Tool	Tests	Accuracy
generate_image	15	100%
nano_banana	10	100%
web_search_agent	15	100%
crawl_website	10	100%
save_memory	5	100%
recall_memory	5	100%

Cross-Modal Coherence Improvement

Transition	Baseline	MOMENTUM	Improvement
Text → Image	0.67	0.89	+32.8%
Text → Video	0.61	0.84	+37.7%
Search → Text	0.73	0.91	+24.7%
Image → Text	0.69	0.87	+26.1%

Latency Distribution

Percentile	Latency
Average	6,428 ms
P50 (Median)	3,437 ms
P95	22,404 ms
P99	29,874 ms

Cost Analysis (Gemini 2.5 Flash)

Metric	Value
Total Tokens (100-test suite)	31,712
Estimated Cost (100 tests)	$0.0052
Cost per Test	~$0.00005

Ablation Study: Context Layer Contributions

Configuration	Accuracy	Impact
Full System (all 6 layers)	94.0%	—
− Brand Soul	81.2%	−12.8%
− User Memory	89.4%	−4.6%
− Settings Context	91.7%	−2.3%
− All Context	72.3%	−21.7%

Key finding: Combined removal impact (−21.7%) exceeds sum of individual removals (−19.7%), demonstrating synergistic interactions between context layers.

7. Automated Test Infrastructure

Frontend (TypeScript/React — Vitest)

Category	Tests	Example Files
Team Intelligence E2E	136	`team-intelligence-e2e.test.ts`
AI Model Configuration	122	`ai-model-config-e2e.test.ts`
Team Management	114	`team-management-e2e.test.ts`
Campaign E2E	103	`campaign-e2e.test.ts`
Media Library	98	`media-library-e2e.test.ts`
Personal Memory	87	`personal-memory-e2e.test.ts`
Personal Profile	67	`personal-profile-e2e.test.ts`
Team Profile	63	`team-profile-e2e.test.ts`
Conversation History	95	`conversation-history.test.tsx`
Agent Tool Accuracy	59	`agent-tool-accuracy.test.tsx`
Component Tests	200+	56 files across 21 directories
Other E2E & Integration	1,171+	Various
Total Frontend	2,315	345 files

Backend (Python — Pytest)

Module	Tests	Description
Core Agent Behavior	150	Agent factory, regression, text/media consistency
Image Generation	100	Imagen 4.0 comprehensive, editing, gallery
Video Generation	15	Veo 3.1, URL handling, image-to-video
Search Functionality	80	Utils, indexing, settings, media search
Vision Analysis	65	Service, endpoints, brand soul integration
Memory Operations	50	Bank config, sync, personal memory, management
Integration Tests	75	Unified endpoints, query generation, YouTube
Configuration	4	Cloud Run config, agent engine settings
Total Backend	539	55 files

8. Performance Baselines

API response time baselines measured across the system:

Operation	Baseline Latency
Cache Hit	5 ms
Streaming Response Init	50 ms
AI Context Load (Cached)	50 ms
Cache Miss with DB	100 ms
Chat History Load	200 ms
Media Library Query	500 ms
AI Context Load (Uncached)	2,000 ms

Additional performance validations: 10,000 cache operations in under 1 second, memory leak detection threshold under 15MB after 10K operations, and 100 concurrent cache operations in under 500ms. All performance tests are in src/test/performance.test.ts (11 tests).

9. Benchmark Runner Architecture

The evaluation infrastructure is implemented as a Python package in python_service/evaluation/ with four modules:

`test_cases.py` — Test Case Definitions

# Structured test case definition
@dataclass
class TestCase:
    id: str
    category: TestCategory
    user_message: str
    expected_tools: List[ToolName]
    description: str
    context: Optional[Dict[str, Any]] = None
    expected_in_response: Optional[List[str]] = None
    follow_up_messages: Optional[List[str]] = None
    difficulty: int = 1  # 1-3, similar to GAIA levels
    tags: List[str] = field(default_factory=list)

`benchmark_runner.py` — Test Execution

# Configuration for benchmark execution
@dataclass
class BenchmarkConfig:
    base_url: str = "http://localhost:8001"
    brand_id: str = "eval_brand"
    user_id: str = "eval_user"
    timeout_seconds: int = 120  # 2 minutes max per test
    max_retries: int = 2
    parallel_tests: bool = False
    model_name: str = "gemini-2.0-flash"

# MomentumBenchmarkRunner sends messages via HTTP POST,
# parses NDJSON response stream, extracts tool calls,
# and evaluates against expected results.

`metrics.py` — Metrics Calculation

# MetricsCalculator computes all evaluation metrics
@dataclass
class EvaluationMetrics:
    overall_accuracy: float
    category_metrics: Dict[str, CategoryMetrics]
    tool_accuracy_by_tool: Dict[str, float]
    avg_latency_ms: float
    p50_latency_ms: float
    p95_latency_ms: float
    p99_latency_ms: float
    total_tokens: int
    total_cost_usd: float
    pass_at_k: Dict[int, float]  # pass@1, pass@3, pass@5
    stability_score: float

`run_eval.py` — CLI Runner

# Selects test suite based on CLI flags and executes evaluation
async with MomentumBenchmarkRunner(config) as runner:
    metrics = await runner.run_test_suite(suite=suite)
    metrics.print_summary()

# Results saved to JSON with full per-test breakdown
with open(args.output, 'w') as f:
    f.write(metrics.to_json())

Citation

@inproceedings{jean2025momentum,
  title={MOMENTUM: Hierarchical Context Injection for Multi-Modal Agent Orchestration in Enterprise Content Generation},
  author={Jean, Huguens},
  booktitle={Google @ NeurIPS},
  year={2025}
}

Back to Project Page