Back to Project Page

MOMENTUM Benchmarks
Comprehensive Evaluation Framework

225+ AI Agent Tests
2,854 Automated Tests
94% Overall Accuracy
100% pass@5
99.26% Stability

1. Evaluation Overview

MOMENTUM’s evaluation employs a dual-tier approach: an AI Agent Evaluation Suite (225+ tests) that tests live agent behavior against expected tool selections and response quality, and an Automated Test Infrastructure (2,854 tests) that validates the entire codebase from unit tests through end-to-end integration.

Evaluation TierTestsFilesFramework
AI Agent Evaluation Suite225+5Custom (Python + NDJSON)
Frontend Automated Tests2,315345Vitest (TypeScript)
Backend Automated Tests53955Pytest (Python)
Total2,854+400+

2. Methodology

The evaluation framework synthesizes methodologies from six established benchmarks:

BenchmarkContribution to MOMENTUM Evaluation
BFCL (Berkeley Function Calling Leaderboard)Tool selection accuracy measurement methodology
AgentBenchMulti-turn interaction evaluation patterns
GAIATask completion assessment with difficulty levels (1–3)
LOCOMOLong-context memory recall accuracy testing
CLASSicCost, Latency, Accuracy, Stability, Security metrics
pass@kStochastic reliability framework (pass@1, pass@3, pass@5)

Each test case is defined as a structured TestCase dataclass with: user message, expected tools, difficulty level (1–3), expected response keywords, and optional follow-up messages for multi-turn evaluation. The benchmark runner sends each message to the agent via HTTP POST, parses the NDJSON response stream, extracts tool calls, and evaluates against expectations.

3. Test Categories

The 225+ tests are organized into 9 categories:

Tool Selection

90 tests

Tests correct tool invocation across all 22 tools. Inspired by BFCL. Covers image generation (15), video generation (10), image editing / Nano Banana (10), web search (15), website crawling (10), event creation (5), memory operations (10), media search (10), and document queries (5).

"Generate an image of a golden retriever playing in a park"
Expected: generate_image · Difficulty: 1
"Render a 3D model of a luxury sports car"
Expected: generate_image · Difficulty: 1
"Visualize a neural network architecture diagram"
Expected: generate_image · Difficulty: 2

Relevance Detection

35 tests

Tests when the agent should not invoke any tool. Critical for avoiding unnecessary API calls and costs. Covers greetings (10), general knowledge (10), math/science (5), programming concepts (5), and opinions (5).

"Hello, how are you today?"
Expected: no_tool · Difficulty: 1
"What is the square root of 144?"
Expected: no_tool · Difficulty: 1

Memory Persistence

25 tests

Tests information storage and retrieval across turns. Inspired by LOCOMO. Covers personal info (10), preferences (5), work context (5), and multi-turn recall (5).

"My name is Alex and I work as a software engineer"
Expected: save_memory · Follow-up: "What's my name?" → expects "Alex"

Context Flow

15 tests

Tests context preservation across multi-tool workflows. Each test requires sequential tool invocations where the output of one informs the next.

"Search for information about electric vehicles, then generate an image of a futuristic EV"
Expected: web_search_agent → generate_image · Difficulty: 2

Multi-Turn

15 tests

Tests conversational coherence across multiple exchanges. Inspired by AgentBench. Each test includes 2–4 follow-up messages building on context.

"Let's create a marketing campaign" → "It's for our new product launch" → "The target audience is young professionals" → "Now create the campaign event"
Expected: no_tool → no_tool → no_tool → create_event

Error Recovery

15 tests

Tests graceful handling of incomplete, ambiguous, or malformed requests. The agent should ask for clarification instead of making incorrect tool calls.

"Generate an image" (incomplete request)
Expected: no_tool · Response should ask for details
"Edit the image" (no image provided)
Expected: no_tool · Response should ask which image

Edge Cases

15 tests

Tests boundary conditions: abstract concepts, futuristic queries, extremely long inputs, special characters, empty-like inputs, and unusual phrasings.

"Generate an image of nothing"
Expected: generate_image · Difficulty: 2
"Search for information about the year 3000"
Expected: web_search_agent · Difficulty: 2

Adversarial

15 tests

Tests robustness to adversarial inputs: negative instructions, changed-mind indications, conflicting requests, and prompt injection attempts.

"Don't generate an image, just describe one"
Expected: no_tool · Difficulty: 2
"I was thinking about generating an image but decided not to"
Expected: no_tool · Difficulty: 2

4. How to Run Benchmarks

AI Agent Evaluation Suite

Run from the python_service/ directory:

# Quick validation (6 tests, ~1 minute) python -m evaluation.run_eval --quick # Core suite without video generation (50 tests, ~10 minutes) python -m evaluation.run_eval --core # Extended core (100 tests, no video, ~20 minutes) python -m evaluation.run_eval --extended # Full suite without video (180+ tests, ~35 minutes) python -m evaluation.run_eval --full-no-video # Full evaluation (225+ tests, ~45 minutes) python -m evaluation.run_eval # Run against a remote Cloud Run deployment python -m evaluation.run_eval --url https://momentum-XXXXX.us-central1.run.app # Save results to JSON python -m evaluation.run_eval --output evaluation_results.json

Automated Test Suites

# Frontend tests (TypeScript/React - 2,315 tests) npm run test:run # Backend tests (Python - 539 tests) pytest python_service/tests/ # Specific test file npx vitest run src/test/team-management-e2e.test.ts pytest python_service/tests/test_search_utils.py -v # Performance tests only npx vitest run src/test/performance.test.ts # Cloud Run deployment validation npx vitest run src/test/cloud-run-deployment.test.ts

5. Metrics Definitions

MetricDefinitionFormula
Overall Accuracy Fraction of tests that pass all evaluation criteria passed / total
Tool Selection Accuracy Correct tool calls vs. expected tool calls correct_calls / expected_calls
False Positive Rate Tools called when they shouldn’t be (waste) unexpected_calls / total_calls
False Negative Rate Tools not called when they should be (missed) missed_calls / expected_calls
Stability Score Consistency of pass rates across categories 1 − Var(pass_ratecategory)
pass@k Probability of passing within k attempts 1 − (1 − p)k
Cross-Modal Coherence Context preservation across modality transitions Semantic similarity between input context and output across tool boundaries
Latency (P50/P95/P99) End-to-end response time percentiles Measured per-test from HTTP POST to stream completion
Cost per Test Token-based cost estimation tokens × price_per_token (Gemini 2.5 Flash pricing)

6. Results Dashboard

Overall Results

MetricResult
Overall Accuracy94.0%
Stability Score99.26%
pass@194.0%
pass@399.98%
pass@5100.0%

Per-Category Accuracy

CategoryTestsAccuracy
Tool Selection90
100%
Relevance Detection35
97%
Memory Persistence25
92%
Context Flow15
87%
Multi-Turn15
87%
Error Recovery15
93%
Edge Cases15
87%
Adversarial15
87%

Per-Tool Selection Accuracy (n=60)

ToolTestsAccuracy
generate_image15100%
nano_banana10100%
web_search_agent15100%
crawl_website10100%
save_memory5100%
recall_memory5100%

Cross-Modal Coherence Improvement

TransitionBaselineMOMENTUMImprovement
Text → Image0.670.89+32.8%
Text → Video0.610.84+37.7%
Search → Text0.730.91+24.7%
Image → Text0.690.87+26.1%

Latency Distribution

PercentileLatency
Average6,428 ms
P50 (Median)3,437 ms
P9522,404 ms
P9929,874 ms

Cost Analysis (Gemini 2.5 Flash)

MetricValue
Total Tokens (100-test suite)31,712
Estimated Cost (100 tests)$0.0052
Cost per Test~$0.00005

Ablation Study: Context Layer Contributions

ConfigurationAccuracyImpact
Full System (all 6 layers)94.0%
− Brand Soul81.2%−12.8%
− User Memory89.4%−4.6%
− Settings Context91.7%−2.3%
− All Context72.3%−21.7%

Key finding: Combined removal impact (−21.7%) exceeds sum of individual removals (−19.7%), demonstrating synergistic interactions between context layers.

7. Automated Test Infrastructure

Frontend (TypeScript/React — Vitest)

CategoryTestsExample Files
Team Intelligence E2E136team-intelligence-e2e.test.ts
AI Model Configuration122ai-model-config-e2e.test.ts
Team Management114team-management-e2e.test.ts
Campaign E2E103campaign-e2e.test.ts
Media Library98media-library-e2e.test.ts
Personal Memory87personal-memory-e2e.test.ts
Personal Profile67personal-profile-e2e.test.ts
Team Profile63team-profile-e2e.test.ts
Conversation History95conversation-history.test.tsx
Agent Tool Accuracy59agent-tool-accuracy.test.tsx
Component Tests200+56 files across 21 directories
Other E2E & Integration1,171+Various
Total Frontend2,315345 files

Backend (Python — Pytest)

ModuleTestsDescription
Core Agent Behavior150Agent factory, regression, text/media consistency
Image Generation100Imagen 4.0 comprehensive, editing, gallery
Video Generation15Veo 3.1, URL handling, image-to-video
Search Functionality80Utils, indexing, settings, media search
Vision Analysis65Service, endpoints, brand soul integration
Memory Operations50Bank config, sync, personal memory, management
Integration Tests75Unified endpoints, query generation, YouTube
Configuration4Cloud Run config, agent engine settings
Total Backend53955 files

8. Performance Baselines

API response time baselines measured across the system:

OperationBaseline Latency
Cache Hit5 ms
Streaming Response Init50 ms
AI Context Load (Cached)50 ms
Cache Miss with DB100 ms
Chat History Load200 ms
Media Library Query500 ms
AI Context Load (Uncached)2,000 ms

Additional performance validations: 10,000 cache operations in under 1 second, memory leak detection threshold under 15MB after 10K operations, and 100 concurrent cache operations in under 500ms. All performance tests are in src/test/performance.test.ts (11 tests).

9. Benchmark Runner Architecture

The evaluation infrastructure is implemented as a Python package in python_service/evaluation/ with four modules:

test_cases.py — Test Case Definitions

# Structured test case definition @dataclass class TestCase: id: str category: TestCategory user_message: str expected_tools: List[ToolName] description: str context: Optional[Dict[str, Any]] = None expected_in_response: Optional[List[str]] = None follow_up_messages: Optional[List[str]] = None difficulty: int = 1 # 1-3, similar to GAIA levels tags: List[str] = field(default_factory=list)

benchmark_runner.py — Test Execution

# Configuration for benchmark execution @dataclass class BenchmarkConfig: base_url: str = "http://localhost:8001" brand_id: str = "eval_brand" user_id: str = "eval_user" timeout_seconds: int = 120 # 2 minutes max per test max_retries: int = 2 parallel_tests: bool = False model_name: str = "gemini-2.0-flash" # MomentumBenchmarkRunner sends messages via HTTP POST, # parses NDJSON response stream, extracts tool calls, # and evaluates against expected results.

metrics.py — Metrics Calculation

# MetricsCalculator computes all evaluation metrics @dataclass class EvaluationMetrics: overall_accuracy: float category_metrics: Dict[str, CategoryMetrics] tool_accuracy_by_tool: Dict[str, float] avg_latency_ms: float p50_latency_ms: float p95_latency_ms: float p99_latency_ms: float total_tokens: int total_cost_usd: float pass_at_k: Dict[int, float] # pass@1, pass@3, pass@5 stability_score: float

run_eval.py — CLI Runner

# Selects test suite based on CLI flags and executes evaluation async with MomentumBenchmarkRunner(config) as runner: metrics = await runner.run_test_suite(suite=suite) metrics.print_summary() # Results saved to JSON with full per-test breakdown with open(args.output, 'w') as f: f.write(metrics.to_json())

Citation


Back to Project Page