Case Study: OpenClaw Operator And Eval System
One-Line Version
I used OpenClaw as an AI operator cockpit around TeachClaw: routing work to agents, surfacing commands, running validation lanes, preserving memory, and turning failures into repeatable proving loops.
Problem
Building an AI product with agents creates a second system to manage: context, memory, routing, validation, live/runtime drift, and task handoff. Without an operator layer, the team repeats prompts, loses state, and promotes changes without enough evidence.
System
OpenClaw provides the runtime shell:
- Gateway and chat/session layer
- Plugins and skills
- Agent contexts and command surfaces
- Cron/heartbeat jobs
- LCM long-context memory database
- Local operator cockpit for TeachClaw development
TeachClaw then adds repo-local operating modes:
teachclawdevelopment: default monothread development flowteachclawoperator: operator-over-workers escalation pathteachclaw-test: proving runtime- local test gateway and agentic eval harness
What I Owned
- GPT-only route cleanup for TeachClaw operator work.
- Command/skill discoverability so the operator system can invoke TeachClaw loops visibly.
- Validation artifact contract with concise trust summaries.
- Local/live boundary design: local autonomous work is allowed; live external actions need explicit approval.
- Reliability-first priority order: reliability, speed, autonomy, cost.
Evaluation Design
The eval stack evolved from simple tests into product-shaped gates:
- Focused unit/type/runtime checks
- Local TeachClaw test gateway
- Scenario packs for teacher-core, reliability, teacher experience, Abshir transcript, memory, and observation-only soak work
- Promotion summaries with
trust_level,next_action, failing gates, and drift status - Regression diagnosis before patching surprising failures
Example Scenarios
regression-no-debug-chatter: prevent internal dev text from leaking to teachers.worksheet-core-generation: verify core worksheet artifact generation.delivery-file-send-check: check generated files are delivered as files, not raw paths/text.provisioning-cloud-init-contract: protect runtime provisioning assumptions.journey-local-change-to-promotion-candidate: model the path from local change to promotion discussion.
Hard Problems
Context Drift
Generated current-state files can lag the real checkout. The operator model treats docs as useful but always verifies branch state, recent commits, and validation artifacts before trusting them.
False Confidence
Mechanical pass is not always product pass. .docx and
.pptx outputs may still require human quality judgement
even when the runtime is correct.
Agentic Loop Reliability
Long-running agent tests need checkpointing, resumability, artifact copying, and clear failure classification. Otherwise the harness itself becomes the failure source.
Proof
- Local reliability pack reached a trustworthy local bar with 5/5 mechanical passes and no lane-instability failures in the remembered run.
- Teacher-memory long-session eval reached 1 pass, 0 fail, 0 needs judgement in the remembered proof object.
- Operator cleanup made TeachClaw commands visible from the main OpenClaw control system.
- Validation summaries became compact enough for future agents to decide quickly whether a run was trustworthy.
Role Relevance
This maps to FDE/AI deployment roles because enterprise AI deployments need more than model calls:
- deployment gates
- customer-shaped scenarios
- eval design
- production incident learning
- crisp status reporting
- safe autonomy boundaries
- repeatable deployment patterns
That is the actual work behind “AI adoption” once prototypes meet real users.