Stage WXO-2 · Evaluate · 8 min
Bob Evaluates Agents
Validate agent behavior before production — quality, accuracy, and security
TALKING POINT · #1
You wouldn't ship code without tests — the same applies to agents. Bob doesn't just build your Intelligent Loan Processing System, it validates it before a single loan application reaches production. The watsonx Orchestrate evaluation framework simulates real user interactions, checks that every agent calls the right tools in the right order, and even red-teams your system for prompt injection vulnerabilities. This is the quality gate that separates a demo from a production system.
TALKING POINT · #2
The lifecycle is iterative: Develop → Evaluate → Analyze → Improve. Bob accelerates every step. It generates your ground truth test datasets from user stories, runs evaluations automatically, and surfaces exactly where agents deviate — wrong tool call order, parameter mismatches, low-quality summaries. You fix what the data tells you to fix, then re-evaluate. Only when Journey Success is green do you move to deploy.
Demo Flow: Evaluating the Loan Processing System
-
1Generate Test Dataset from User StoriesBob auto-generates ground truth evaluation datasets by combining your agent's tool definitions with natural language user stories. No manual test authoring required.# Create a user_stories.csv with loan scenarios
# Columns: story, agent
# Example row: "Evaluate loan application for Acme Corp seeking $2M", "loan_orchestrator"
# Bob generates ground truth test cases
orchestrate evaluations generate \
--stories-path ./eval/user_stories.csv \
--tools-path ./loan_agent/tools/💡What Bob generates: A structured JSON dataset with the full expected tool call sequence, dependency graph between goals, expected final summary, and the starting user utterance — everything the evaluator needs to judge correctness. -
2Quick Eval — Fast Sanity CheckNo ground truth needed. Quick eval catches the most common failure modes immediately: schema mismatches between tool inputs/outputs and hallucinated tool invocations.orchestrate evaluations quick-eval \
-p ./eval/test_cases/ \
-o ./eval/results/quick/ \
-t ./loan_agent/tools/Quick Eval MetricsTool CallsTotal tool invocation attempts across all test scenariosSuccessful Tool CallsInvocations that completed without errorsSchema Mismatch FailuresTool called with wrong input/output format — caught before productionHallucination FailuresAgent attempted to call tools that don't exist -
3Full Evaluation — Journey SuccessRun the full evaluation suite against the generated ground truth dataset. An LLM-powered user agent simulates real loan applicants while the framework checks every agent decision against expected outcomes.orchestrate evaluations evaluate \
--test-paths ./eval/loan_orchestrator_snapshot_llm.json \
--output-dir ./eval/results/full/Full Evaluation MetricsJourney SuccessDid the agent call all tools in the correct order with the right parameters? The primary go/no-go signal.Tool Call PrecisionCorrect tool calls ÷ total calls made. Penalizes unnecessary tool invocations.Tool Call RecallWhether required tools were called in proper sequence. Penalizes skipped steps.Text MatchSimilarity between the agent's final response and the expected summary (0–100%).Agent Routing AccuracyWhether the orchestrator routed tasks to the correct specialist agents.Answer RelevancyFor knowledge-base calls: did retrieved context actually address the query? -
4Analyze Results — Find and Fix IssuesDrill into failures. The analyze command shows a step-by-step conversation replay, highlighting exactly where the agent deviated — wrong parameter, missed tool call, or poor response quality.orchestrate evaluations analyze \
--results-dir ./eval/results/full/ \
--tools-path ./loan_agent/tools/💡Pro Tip: Run analyze with--tools-pathto get docstring quality recommendations. Bob uses the analysis output to suggest improved tool descriptions that reduce routing errors on the next evaluation pass. -
5LLM Vulnerability Scan OptionalRed-team the loan processing agent against 15 attack types — from prompt injection to jailbreaking to social engineering. Aligned with OWASP Top 10 for LLM Applications.# List all 15 supported attack types
orchestrate evaluations red-teaming list
# Generate attack scenarios for the loan orchestrator
orchestrate evaluations red-teaming plan \
--agent loan_orchestrator \
--dataset ./eval/loan_orchestrator_snapshot_llm.json \
--attacks instruction_override,prompt_leakage,role_playing
# Execute and measure results
orchestrate evaluations red-teaming run \
--plan ./eval/red_team_plan.jsonAttack CategoriesOn-policyInstruction Override, Emotional Appeals, Role-playingOff-policyPrompt Leakage, Jailbreaking, Encoded Obfuscation15 typesTotal Attack VariantsOWASPTop 10 LLM Aligned
Evaluation at a Glance
Journey Success
Primary Go/No-Go Metric
Auto-generated
Test Datasets
No Ground Truth
Quick Eval Mode
15 Attack Types
Red-Team Coverage
Local + SaaS
Environment Support
Develop → Eval → Improve
Iterative Loop
Setup Note: Evaluation can run locally against watsonx Orchestrate Developer Edition or against SaaS/on-premises instances (draft environment only). Configure your
.env with WATSONX_INSTANCE_URL and WATSONX_API_KEY before running.
Why This Matters for Customers:
In financial services, a loan decision that skips a compliance check or misroutes to the wrong specialist agent isn't just a bug — it's a regulatory risk. Journey Success gives you a measurable, repeatable confidence score before production. Red-teaming ensures your loan agent can't be manipulated by a bad actor into approving fraudulent applications. Bob makes this entire quality pipeline something a developer can run in minutes, not a QA cycle that takes weeks.
In financial services, a loan decision that skips a compliance check or misroutes to the wrong specialist agent isn't just a bug — it's a regulatory risk. Journey Success gives you a measurable, repeatable confidence score before production. Red-teaming ensures your loan agent can't be manipulated by a bad actor into approving fraudulent applications. Bob makes this entire quality pipeline something a developer can run in minutes, not a QA cycle that takes weeks.