MOMENTUM Benchmarks
Comprehensive Evaluation Framework
1. Evaluation Overview
MOMENTUM’s evaluation employs a dual-tier approach: an AI Agent Evaluation Suite (225+ tests) that tests live agent behavior against expected tool selections and response quality, and an Automated Test Infrastructure (2,854 tests) that validates the entire codebase from unit tests through end-to-end integration.
| Evaluation Tier | Tests | Files | Framework |
|---|---|---|---|
| AI Agent Evaluation Suite | 225+ | 5 | Custom (Python + NDJSON) |
| Frontend Automated Tests | 2,315 | 345 | Vitest (TypeScript) |
| Backend Automated Tests | 539 | 55 | Pytest (Python) |
| Total | 2,854+ | 400+ | — |
2. Methodology
The evaluation framework synthesizes methodologies from six established benchmarks:
| Benchmark | Contribution to MOMENTUM Evaluation |
|---|---|
| BFCL (Berkeley Function Calling Leaderboard) | Tool selection accuracy measurement methodology |
| AgentBench | Multi-turn interaction evaluation patterns |
| GAIA | Task completion assessment with difficulty levels (1–3) |
| LOCOMO | Long-context memory recall accuracy testing |
| CLASSic | Cost, Latency, Accuracy, Stability, Security metrics |
| pass@k | Stochastic reliability framework (pass@1, pass@3, pass@5) |
Each test case is defined as a structured TestCase dataclass with: user message, expected tools, difficulty level (1–3), expected response keywords, and optional follow-up messages for multi-turn evaluation. The benchmark runner sends each message to the agent via HTTP POST, parses the NDJSON response stream, extracts tool calls, and evaluates against expectations.
3. Test Categories
The 225+ tests are organized into 9 categories:
Tool Selection
90 testsTests correct tool invocation across all 22 tools. Inspired by BFCL. Covers image generation (15), video generation (10), image editing / Nano Banana (10), web search (15), website crawling (10), event creation (5), memory operations (10), media search (10), and document queries (5).
Expected: generate_image · Difficulty: 1
Expected: generate_image · Difficulty: 1
Expected: generate_image · Difficulty: 2
Relevance Detection
35 testsTests when the agent should not invoke any tool. Critical for avoiding unnecessary API calls and costs. Covers greetings (10), general knowledge (10), math/science (5), programming concepts (5), and opinions (5).
Expected: no_tool · Difficulty: 1
Expected: no_tool · Difficulty: 1
Memory Persistence
25 testsTests information storage and retrieval across turns. Inspired by LOCOMO. Covers personal info (10), preferences (5), work context (5), and multi-turn recall (5).
Expected: save_memory · Follow-up: "What's my name?" → expects "Alex"
Context Flow
15 testsTests context preservation across multi-tool workflows. Each test requires sequential tool invocations where the output of one informs the next.
Expected: web_search_agent → generate_image · Difficulty: 2
Multi-Turn
15 testsTests conversational coherence across multiple exchanges. Inspired by AgentBench. Each test includes 2–4 follow-up messages building on context.
Expected: no_tool → no_tool → no_tool → create_event
Error Recovery
15 testsTests graceful handling of incomplete, ambiguous, or malformed requests. The agent should ask for clarification instead of making incorrect tool calls.
Expected: no_tool · Response should ask for details
Expected: no_tool · Response should ask which image
Edge Cases
15 testsTests boundary conditions: abstract concepts, futuristic queries, extremely long inputs, special characters, empty-like inputs, and unusual phrasings.
Expected: generate_image · Difficulty: 2
Expected: web_search_agent · Difficulty: 2
Adversarial
15 testsTests robustness to adversarial inputs: negative instructions, changed-mind indications, conflicting requests, and prompt injection attempts.
Expected: no_tool · Difficulty: 2
Expected: no_tool · Difficulty: 2
4. How to Run Benchmarks
AI Agent Evaluation Suite
Run from the python_service/ directory:
Automated Test Suites
5. Metrics Definitions
| Metric | Definition | Formula |
|---|---|---|
| Overall Accuracy | Fraction of tests that pass all evaluation criteria | passed / total |
| Tool Selection Accuracy | Correct tool calls vs. expected tool calls | correct_calls / expected_calls |
| False Positive Rate | Tools called when they shouldn’t be (waste) | unexpected_calls / total_calls |
| False Negative Rate | Tools not called when they should be (missed) | missed_calls / expected_calls |
| Stability Score | Consistency of pass rates across categories | 1 − Var(pass_ratecategory) |
| pass@k | Probability of passing within k attempts | 1 − (1 − p)k |
| Cross-Modal Coherence | Context preservation across modality transitions | Semantic similarity between input context and output across tool boundaries |
| Latency (P50/P95/P99) | End-to-end response time percentiles | Measured per-test from HTTP POST to stream completion |
| Cost per Test | Token-based cost estimation | tokens × price_per_token (Gemini 2.5 Flash pricing) |
6. Results Dashboard
Overall Results
| Metric | Result |
|---|---|
| Overall Accuracy | 94.0% |
| Stability Score | 99.26% |
| pass@1 | 94.0% |
| pass@3 | 99.98% |
| pass@5 | 100.0% |
Per-Category Accuracy
| Category | Tests | Accuracy | |
|---|---|---|---|
| Tool Selection | 90 | ||
| Relevance Detection | 35 | ||
| Memory Persistence | 25 | ||
| Context Flow | 15 | ||
| Multi-Turn | 15 | ||
| Error Recovery | 15 | ||
| Edge Cases | 15 | ||
| Adversarial | 15 |
Per-Tool Selection Accuracy (n=60)
| Tool | Tests | Accuracy |
|---|---|---|
| generate_image | 15 | 100% |
| nano_banana | 10 | 100% |
| web_search_agent | 15 | 100% |
| crawl_website | 10 | 100% |
| save_memory | 5 | 100% |
| recall_memory | 5 | 100% |
Cross-Modal Coherence Improvement
| Transition | Baseline | MOMENTUM | Improvement |
|---|---|---|---|
| Text → Image | 0.67 | 0.89 | +32.8% |
| Text → Video | 0.61 | 0.84 | +37.7% |
| Search → Text | 0.73 | 0.91 | +24.7% |
| Image → Text | 0.69 | 0.87 | +26.1% |
Latency Distribution
| Percentile | Latency |
|---|---|
| Average | 6,428 ms |
| P50 (Median) | 3,437 ms |
| P95 | 22,404 ms |
| P99 | 29,874 ms |
Cost Analysis (Gemini 2.5 Flash)
| Metric | Value |
|---|---|
| Total Tokens (100-test suite) | 31,712 |
| Estimated Cost (100 tests) | $0.0052 |
| Cost per Test | ~$0.00005 |
Ablation Study: Context Layer Contributions
| Configuration | Accuracy | Impact |
|---|---|---|
| Full System (all 6 layers) | 94.0% | — |
| − Brand Soul | 81.2% | −12.8% |
| − User Memory | 89.4% | −4.6% |
| − Settings Context | 91.7% | −2.3% |
| − All Context | 72.3% | −21.7% |
Key finding: Combined removal impact (−21.7%) exceeds sum of individual removals (−19.7%), demonstrating synergistic interactions between context layers.
7. Automated Test Infrastructure
Frontend (TypeScript/React — Vitest)
| Category | Tests | Example Files |
|---|---|---|
| Team Intelligence E2E | 136 | team-intelligence-e2e.test.ts |
| AI Model Configuration | 122 | ai-model-config-e2e.test.ts |
| Team Management | 114 | team-management-e2e.test.ts |
| Campaign E2E | 103 | campaign-e2e.test.ts |
| Media Library | 98 | media-library-e2e.test.ts |
| Personal Memory | 87 | personal-memory-e2e.test.ts |
| Personal Profile | 67 | personal-profile-e2e.test.ts |
| Team Profile | 63 | team-profile-e2e.test.ts |
| Conversation History | 95 | conversation-history.test.tsx |
| Agent Tool Accuracy | 59 | agent-tool-accuracy.test.tsx |
| Component Tests | 200+ | 56 files across 21 directories |
| Other E2E & Integration | 1,171+ | Various |
| Total Frontend | 2,315 | 345 files |
Backend (Python — Pytest)
| Module | Tests | Description |
|---|---|---|
| Core Agent Behavior | 150 | Agent factory, regression, text/media consistency |
| Image Generation | 100 | Imagen 4.0 comprehensive, editing, gallery |
| Video Generation | 15 | Veo 3.1, URL handling, image-to-video |
| Search Functionality | 80 | Utils, indexing, settings, media search |
| Vision Analysis | 65 | Service, endpoints, brand soul integration |
| Memory Operations | 50 | Bank config, sync, personal memory, management |
| Integration Tests | 75 | Unified endpoints, query generation, YouTube |
| Configuration | 4 | Cloud Run config, agent engine settings |
| Total Backend | 539 | 55 files |
8. Performance Baselines
API response time baselines measured across the system:
| Operation | Baseline Latency |
|---|---|
| Cache Hit | 5 ms |
| Streaming Response Init | 50 ms |
| AI Context Load (Cached) | 50 ms |
| Cache Miss with DB | 100 ms |
| Chat History Load | 200 ms |
| Media Library Query | 500 ms |
| AI Context Load (Uncached) | 2,000 ms |
Additional performance validations: 10,000 cache operations in under 1 second, memory leak detection threshold under 15MB after 10K operations, and 100 concurrent cache operations in under 500ms. All performance tests are in src/test/performance.test.ts (11 tests).
9. Benchmark Runner Architecture
The evaluation infrastructure is implemented as a Python package in python_service/evaluation/ with four modules: