teachclaw-deployment

Case Study: TeachClaw Deployment

One-Line Version

I built and deployed TeachClaw, a real AI assistant for UK teachers that turns chat messages into teaching artifacts, feedback, and marking support through a live OpenClaw runtime.

Problem

Teachers do not need another dashboard. They need low-friction help inside the workflow they already use: quick resource generation, feedback, marking, and next-lesson planning without fighting a new product surface.

System

TeachClaw is a chat-native AI teaching system across Telegram and browser/site surfaces. It uses OpenClaw as the runtime layer, with task-routing plugins, deterministic builder scripts, and isolated teacher VPS deployments.

Main user-facing capabilities:

Worksheet generation as .docx
Full lesson deck generation as .pptx
Mark schemes from worksheet prompts
Fast ad hoc marking from student work
Structured marking pipelines for longer scripts
Browser chat and generated file downloads
Teacher memory and preference handling

Architecture

flowchart TD
    Teacher["Teacher via Telegram or Browser"] --> Gateway["OpenClaw Gateway"]
    Gateway --> Context["Agent Context: AGENTS, SOUL, USER, MEMORY"]
    Gateway --> Router["Oak Content Plugin / Task Router"]
    Router --> Lesson["Lesson / Deck Route"]
    Router --> Worksheet["Worksheet Route"]
    Router --> Marking["Marking Router"]
    Router --> Memory["Teacher Memory Adapter"]
    Lesson --> PPT["build-pptx.py"]
    Worksheet --> DOCX["build-docx.py"]
    Marking --> Pipeline["OCR / Transcript / Judgement / Report"]
    Memory --> Audit["Memory Events + Scoped Cards"]
    PPT --> Delivery["File Delivery"]
    DOCX --> Delivery
    Pipeline --> Delivery
    Delivery --> Teacher

What I Owned

Product workflow discovery with real teacher usage.
Runtime routing and task-family design.
Deployment model around OpenClaw gateway, plugins, skills, and VPSes.
Local-to-live validation discipline.
Debugging production incidents across source, runtime, gateway, and deployed payload layers.
Cost/performance model-routing experiments.

Hard Problems

Runtime Drift

TeachClaw has several truth layers: repo source, runtime contract mirror, local gateway payload, and live teacher VPS payload. A bug can appear fixed locally while the gateway still loads stale code. I handled this by checking the exact loaded layer before calling a fix real.

Artifact Quality

For teaching artifacts, schema-valid output is not enough. A .pptx can pass mechanical checks but still be pedagogically weak. The validation model separates runtime success from teacher-facing quality judgement.

Deployment Safety

Live teacher lanes are pinned and guarded. Risky changes go through local tests, local test gateway, agentic evals, and only then guarded live smoke when explicitly approved.

Proof

First confirmed classroom use: a founding teacher used TeachClaw to create a Year 8 poetic techniques lesson for the next school day.
Real teacher review covered 10 active days and 95 teacher messages.
Local 4-image literature essay marking completed through the real pipeline end to end.
The local gateway/eval lane now checks routing, context carry, tool discipline, artifact behavior, and memory behavior.
Long-session routing comparison tested 100 teacher-style turns, with the trimmed/current lane completing the full 100/100 turns at lower cost than a wider-context control.

Deployment Lessons

Pin runtime versions on live teacher VPSes; do not casually upgrade production OpenClaw.
Do not use global plugin allowlists that accidentally block built-in runtime plugins.
Replaying the exact failed teacher input is stronger proof than checking container health.
Runtime contracts, deployment scripts, and generated bundles need parity checks.
Teacher-facing trust is separate from file generation success; never leak internal JSON, build logs, or retry chatter.

Role Relevance

This maps directly to AI Deployment / Forward Deployed Engineering:

Took ambiguous user workflow and shipped a working AI system.
Built and deployed model-backed applications under real constraints.
Worked directly from user evidence rather than academic benchmarks.
Added evals, routing, deployment gates, and incident learning.
Balanced product speed with live safety.