openclaw-operator-evals

Case Study: OpenClaw Operator And Eval System

One-Line Version

I used OpenClaw as an AI operator cockpit around TeachClaw: routing work to agents, surfacing commands, running validation lanes, preserving memory, and turning failures into repeatable proving loops.

Problem

Building an AI product with agents creates a second system to manage: context, memory, routing, validation, live/runtime drift, and task handoff. Without an operator layer, the team repeats prompts, loses state, and promotes changes without enough evidence.

System

OpenClaw provides the runtime shell:

Gateway and chat/session layer
Plugins and skills
Agent contexts and command surfaces
Cron/heartbeat jobs
LCM long-context memory database
Local operator cockpit for TeachClaw development

TeachClaw then adds repo-local operating modes:

teachclawdevelopment: default monothread development flow
teachclawoperator: operator-over-workers escalation path
teachclaw-test: proving runtime
local test gateway and agentic eval harness

What I Owned

GPT-only route cleanup for TeachClaw operator work.
Command/skill discoverability so the operator system can invoke TeachClaw loops visibly.
Validation artifact contract with concise trust summaries.
Local/live boundary design: local autonomous work is allowed; live external actions need explicit approval.
Reliability-first priority order: reliability, speed, autonomy, cost.

Evaluation Design

The eval stack evolved from simple tests into product-shaped gates:

Focused unit/type/runtime checks
Local TeachClaw test gateway
Scenario packs for teacher-core, reliability, teacher experience, Abshir transcript, memory, and observation-only soak work
Promotion summaries with trust_level, next_action, failing gates, and drift status
Regression diagnosis before patching surprising failures

Example Scenarios

regression-no-debug-chatter: prevent internal dev text from leaking to teachers.
worksheet-core-generation: verify core worksheet artifact generation.
delivery-file-send-check: check generated files are delivered as files, not raw paths/text.
provisioning-cloud-init-contract: protect runtime provisioning assumptions.
journey-local-change-to-promotion-candidate: model the path from local change to promotion discussion.

Hard Problems

Context Drift

Generated current-state files can lag the real checkout. The operator model treats docs as useful but always verifies branch state, recent commits, and validation artifacts before trusting them.

False Confidence

Mechanical pass is not always product pass. .docx and .pptx outputs may still require human quality judgement even when the runtime is correct.

Agentic Loop Reliability

Long-running agent tests need checkpointing, resumability, artifact copying, and clear failure classification. Otherwise the harness itself becomes the failure source.

Proof

Local reliability pack reached a trustworthy local bar with 5/5 mechanical passes and no lane-instability failures in the remembered run.
Teacher-memory long-session eval reached 1 pass, 0 fail, 0 needs judgement in the remembered proof object.
Operator cleanup made TeachClaw commands visible from the main OpenClaw control system.
Validation summaries became compact enough for future agents to decide quickly whether a run was trustworthy.

Role Relevance

This maps to FDE/AI deployment roles because enterprise AI deployments need more than model calls:

deployment gates
customer-shaped scenarios
eval design
production incident learning
crisp status reporting
safe autonomy boundaries
repeatable deployment patterns

That is the actual work behind “AI adoption” once prototypes meet real users.