Act 3 — Build

Shipyard

Knowledge isn't proof. Build something real — then have it reviewed like a staff engineer would.

Shipyard is a guided build loop you run in your own dev environment: a brief gives you a real problem and milestone tickets, a coach helps when you're stuck, and a grader scores the result.The coaching and grading run in Claude Code — no signup, no server. This page is the catalog.

Pick a brief

A real, job-market-relevant problem with constraints — pre-broken into PR-sized milestone tickets.

Design first

Write a short design doc. It gets reviewed like a staff engineer would — before you write code.

Build in your own repo

Your IDE, your stack, your GitHub. Real engineer mode. The brief defines the target, never the implementation.

Get coached

Stuck, off-track, or want to improve? A coach guides you — hint → direction → how it should be done — and explains the why.

Get graded

Scored on system design, correctness, production readiness, docs, and "would this shine in a portfolio" — with a punch list to ship-grade.

Ship it

End state: a deployed, documented, portfolio-ready project — and a readiness signal you can point at.

Flagship brief · AI / agents

Codebase Q&A + PR-review agent

An agent that answers questions about a real codebase (RAG over a repo) and reviews PR diffs with useful comments.

The spine: The eval harness — curated cases (happy/recoverable/unrecoverable/adversarial), LLM-as-judge gating CI, tracing + cost observability, guardrails. ~88% of agents never ship because the harness is too fragile; this is the part that separates a demo from a product.

M1Design doc + eval plan (the contract)
M2Thin vertical slice (one real answer)
M3The eval harness + golden set
M4LLM-as-judge + calibration
M5CI eval gate (block regressions)
M6Tracing + cost/latency observability
M7Guardrails + abstention
M8Deploy + README + demo

More briefs — planned

A catalog, not one project

The flagship is ready; these are next. Each is deliberately distinct from a typical CRUD app — the durable skill (the "spine") is the point, not the surface.

AI / agentsPlanned

Docs assistant that abstains

A RAG assistant over a real project's docs that answers WITH citations — or says 'I don't know'.

Spine Faithfulness & citation evals + calibrated abstention.

AI / agentsPlanned

Messy-doc extraction service

Turn messy PDFs/emails into validated structured data.

Spine Per-field accuracy evals + schema validation guardrails.

AI / agentsPlanned

Multi-step research analyst

Research a question across sources → a cited report.

Spine Agent-loop + claim verification + trajectory evals.

Realtime / distributedPlanned

Realtime collaborative editor

Multiple users editing shared state live — presence, cursors, conflict resolution.

Spine CRDT/OT + a convergence proof under concurrent edits.

Systems / infraPlanned

Your own queue / API gateway

Build a message queue or API gateway from scratch — your design, not a clone.

Spine Delivery guarantees + metrics + a real load test.