Stage WXO-2 - Bob Evaluates Agents | Bob + watsonx Orchestrate Demo

Stage WXO-2 · Evaluate · 8 min

Bob Evaluates Agents

Validate agent behavior before production — quality, accuracy, and security

💬 TALKING POINT · #1

You wouldn't ship code without tests — the same applies to agents. Bob doesn't just build your Intelligent Loan Processing System, it validates it before a single loan application reaches production. The watsonx Orchestrate evaluation framework simulates real user interactions, checks that every agent calls the right tools in the right order, and even red-teams your system for prompt injection vulnerabilities. This is the quality gate that separates a demo from a production system.

💬 TALKING POINT · #2

The lifecycle is iterative: Develop → Evaluate → Analyze → Improve. Bob accelerates every step. It generates your ground truth test datasets from user stories, runs evaluations automatically, and surfaces exactly where agents deviate — wrong tool call order, parameter mismatches, low-quality summaries. You fix what the data tells you to fix, then re-evaluate. Only when Journey Success is green do you move to deploy.

Demo Flow: Evaluating the Loan Processing System

1

Generate Test Dataset from User Stories

Bob auto-generates ground truth evaluation datasets by combining your agent's tool definitions with natural language user stories. No manual test authoring required.

# Create a user_stories.csv with loan scenarios
# Columns: story, agent
# Example row: "Evaluate loan application for Acme Corp seeking $2M", "loan_orchestrator"

# Bob generates ground truth test cases
orchestrate evaluations generate \
--stories-path ./eval/user_stories.csv \
--tools-path ./loan_agent/tools/

💡
What Bob generates: A structured JSON dataset with the full expected tool call sequence, dependency graph between goals, expected final summary, and the starting user utterance — everything the evaluator needs to judge correctness.
2

Quick Eval — Fast Sanity Check

No ground truth needed. Quick eval catches the most common failure modes immediately: schema mismatches between tool inputs/outputs and hallucinated tool invocations.

orchestrate evaluations quick-eval \
  -p ./eval/test_cases/ \
  -o ./eval/results/quick/ \
  -t ./loan_agent/tools/

Quick Eval Metrics

Tool Calls

Total tool invocation attempts across all test scenarios

Successful Tool Calls

Invocations that completed without errors

Schema Mismatch Failures

Tool called with wrong input/output format — caught before production

Hallucination Failures

Agent attempted to call tools that don't exist
3

Full Evaluation — Journey Success

Run the full evaluation suite against the generated ground truth dataset. An LLM-powered user agent simulates real loan applicants while the framework checks every agent decision against expected outcomes.

orchestrate evaluations evaluate \
--test-paths ./eval/loan_orchestrator_snapshot_llm.json \
--output-dir ./eval/results/full/

Full Evaluation Metrics

Journey Success

Did the agent call all tools in the correct order with the right parameters? The primary go/no-go signal.

Tool Call Precision

Correct tool calls ÷ total calls made. Penalizes unnecessary tool invocations.

Tool Call Recall

Whether required tools were called in proper sequence. Penalizes skipped steps.

Text Match

Similarity between the agent's final response and the expected summary (0–100%).

Agent Routing Accuracy

Whether the orchestrator routed tasks to the correct specialist agents.

Answer Relevancy

For knowledge-base calls: did retrieved context actually address the query?
4

Analyze Results — Find and Fix Issues

Drill into failures. The analyze command shows a step-by-step conversation replay, highlighting exactly where the agent deviated — wrong parameter, missed tool call, or poor response quality.

orchestrate evaluations analyze \
--results-dir ./eval/results/full/ \
--tools-path ./loan_agent/tools/

💡
Pro Tip: Run analyze with --tools-path to get docstring quality recommendations. Bob uses the analysis output to suggest improved tool descriptions that reduce routing errors on the next evaluation pass.
5

LLM Vulnerability Scan Optional

Red-team the loan processing agent against 15 attack types — from prompt injection to jailbreaking to social engineering. Aligned with OWASP Top 10 for LLM Applications.

# List all 15 supported attack types
orchestrate evaluations red-teaming list

# Generate attack scenarios for the loan orchestrator
orchestrate evaluations red-teaming plan \
  --agent loan_orchestrator \
  --dataset ./eval/loan_orchestrator_snapshot_llm.json \
  --attacks instruction_override,prompt_leakage,role_playing

# Execute and measure results
orchestrate evaluations red-teaming run \
  --plan ./eval/red_team_plan.json

Attack Categories

On-policy

Instruction Override, Emotional Appeals, Role-playing

Off-policy

Prompt Leakage, Jailbreaking, Encoded Obfuscation

15 types

Total Attack Variants

OWASP

Top 10 LLM Aligned

Evaluation at a Glance

Journey Success

Primary Go/No-Go Metric

Auto-generated

Test Datasets

No Ground Truth

Quick Eval Mode

15 Attack Types

Red-Team Coverage

Local + SaaS

Environment Support

Develop → Eval → Improve

Iterative Loop

⚙

Setup Note: Evaluation can run locally against watsonx Orchestrate Developer Edition or against SaaS/on-premises instances (draft environment only). Configure your .env with WATSONX_INSTANCE_URL and WATSONX_API_KEY before running.

💡

Why This Matters for Customers:

In financial services, a loan decision that skips a compliance check or misroutes to the wrong specialist agent isn't just a bug — it's a regulatory risk. Journey Success gives you a measurable, repeatable confidence score before production. Red-teaming ensures your loan agent can't be manipulated by a bad actor into approving fraudulent applications. Bob makes this entire quality pipeline something a developer can run in minutes, not a QA cycle that takes weeks.

← Back: Stage 1 Next: Stage 3 →