TeachClaw deployment case study

Agentic work layer for teachers.

TeachClaw is a chat-native AI assistant for UK teachers. It turns plain teacher requests into lesson decks, worksheets, mark schemes, student feedback, marking support, and classroom artifacts through a live OpenClaw runtime.

Teacher workflows Task routing Deterministic artifacts Agentic evals VPS deployment
TeachClaw and OpenClaw architecture diagram
Public-safe architecture: teacher surfaces, TeachClaw platform layer, OpenClaw runtime, deterministic builders, validation, and guarded promotion.

Proof

These are public-safe numbers from the career evidence pack. They show usage, eval depth, and the kind of release discipline behind the project.

95 teacher messages reviewed across TeachClaw pilot usage.
100/100 teacher-style turns completed in a long-session routing comparison.
4-image literature essay marking path completed through the real local pipeline.
5 gates routing, context carry, artifacts, tool discipline, and memory behavior.

Problem and system

Teachers do not need another dashboard.

The product thesis is that education software should remove work, not add another inbox. A teacher should be able to send a normal request and get finished school work back: a deck, worksheet, feedback draft, mark scheme, or marking report.

The system is deliberately workflow-first. TeachClaw uses browser and Telegram surfaces, OpenClaw as the runtime layer, task-family routing, deterministic builders for generated files, and validation lanes before promotion.

Lesson decks

Turns a teacher request into a real .pptx artifact using structured intermediate JSON plus a Python builder.

Worksheets

Generates classroom worksheets and answer keys as real .docx artifacts rather than only chat text.

Marking support

Routes student work through OCR/transcript, judgement, correction, and report-rendering paths.

Teacher memory

Uses scoped memory cards so preferences can improve output without leaking raw memory or cross-teacher context.

Workflow

The case study is strongest when it is framed as a deployment loop. The model is only one part of the system.

01

Natural teacher request

Example public-safe scenario: create a Year 11 Macbeth ambition analysis PowerPoint for AQA Paper 1 and return the final file path only.

02

Route and contract

The runtime decides whether this is a PowerPoint, worksheet, marking, email, admin, or chat route. Output expectations are explicit: text reply or artifact path, extension, required terms, and forbidden leakage.

03

Tool-backed artifact

For generated resources, the agent must call the correct builder and create a file. A route that merely promises a deck without creating one should fail.

04

Evidence before live promotion

The promotion summary records source of truth, commit SHA, built artifact hash, loaded test-runtime hash, scenario outcomes, risks, and whether human approval is still required.

Eval Snippets

These are sanitized snippets from the shape of the TeachClaw eval packs and promotion summaries. They show the kind of behavior that gets checked.

agentic scenario artifact path
{
  "id": "ppt-macbeth-ambition-analysis",
  "family": "powerpoint",
  "thread": [{
    "message": "Create a Year 11 Macbeth ambition analysis PowerPoint for AQA Paper 1. Build the actual .pptx file and return the final file path only."
  }],
  "expected_output": {
    "kind": "artifact_path",
    "extension": ".pptx"
  }
}

The eval is product-shaped: the route has to produce the actual artifact, not just a nice response.

trace checks runtime behavior
{
  "required_route_log_substrings": [
    "route=hybrid_staged"
  ],
  "required_exec_patterns": [
    "build-pptx.py --file ... .pptx"
  ],
  "forbidden_exec_patterns": [
    "ad hoc shell generation"
  ],
  "max_exec_calls": 4
}

This catches false success, wrong-route behavior, and unbounded tool use.

teacher memory isolation
{
  "required_route_log_substrings": [
    "memory_card route=worksheet",
    "memory_event route=worksheet status=logged"
  ],
  "forbidden_route_log_substrings": [
    "other_teacher_private",
    "raw_memory_dump"
  ]
}

Memory is useful only if it is scoped. The eval checks that the active teacher gets the right context without leakage.

promotion summary release decision
worksheet-core-generation:
  status: needs_judgement
  mechanical_result: pass
  quality_result: needs_judgement
  drift_status: pass

recommendation:
  status: not ready
  why: at least one latest scenario is failing

A mechanical pass is not treated as production readiness. Teaching quality and live-surface proof still matter.

Hard Problems

The project became useful because the hard parts were not abstract. They appeared through real usage, runtime drift, and teacher-facing quality constraints.

Runtime drift

TeachClaw has repo source, runtime mirrors, gateway-loaded payloads, and live VPS payloads. A fix is not real until the loaded layer is proven.

Artifact quality

A .pptx can render and still be weak teaching. The validation model separates mechanical success from teacher-quality judgement.

Live safety

Risky changes move through local tests, local gateway proof, agentic evals, and guarded live smoke only when explicitly approved.

The real lesson: production AI is not just model output. It is routing, memory, contracts, evals, artifacts, deployment truth, and knowing when not to ship.

FDE Relevance

This is why TeachClaw is a strong career signal for AI Deployment and Forward Deployed Engineering roles.

Discovery to system

Turned teacher pain into product surfaces and concrete workflow routes instead of generic AI chat.

System to deployment

Worked across TypeScript, Python builders, Docker/VPS deployment, browser/Telegram surfaces, and runtime debugging.

Deployment to evidence

Converted repeated failure modes into eval packs, promotion summaries, local/live safety boundaries, and public-safe case studies.