Act 3 — Build
Shipyard
Knowledge isn't proof. Build something real — then have it reviewed like a staff engineer would.
Shipyard is a guided build loop you run in your own dev environment: a brief gives you a real problem and milestone tickets, a coach helps when you're stuck, and a grader scores the result.The coaching and grading run in Claude Code — no signup, no server. This page is the catalog.
Pick a brief
A real, job-market-relevant problem with constraints — pre-broken into PR-sized milestone tickets.
Design first
Write a short design doc. It gets reviewed like a staff engineer would — before you write code.
Build in your own repo
Your IDE, your stack, your GitHub. Real engineer mode. The brief defines the target, never the implementation.
Get coached
Stuck, off-track, or want to improve? A coach guides you — hint → direction → how it should be done — and explains the why.
Get graded
Scored on system design, correctness, production readiness, docs, and "would this shine in a portfolio" — with a punch list to ship-grade.
Ship it
End state: a deployed, documented, portfolio-ready project — and a readiness signal you can point at.
Codebase Q&A + PR-review agent
An agent that answers questions about a real codebase (RAG over a repo) and reviews PR diffs with useful comments.
The spine: The eval harness — curated cases (happy/recoverable/unrecoverable/adversarial), LLM-as-judge gating CI, tracing + cost observability, guardrails. ~88% of agents never ship because the harness is too fragile; this is the part that separates a demo from a product.
- M1Design doc + eval plan (the contract)
- M2Thin vertical slice (one real answer)
- M3The eval harness + golden set
- M4LLM-as-judge + calibration
- M5CI eval gate (block regressions)
- M6Tracing + cost/latency observability
- M7Guardrails + abstention
- M8Deploy + README + demo
More briefs — planned
A catalog, not one project
The flagship is ready; these are next. Each is deliberately distinct from a typical CRUD app — the durable skill (the "spine") is the point, not the surface.
Docs assistant that abstains
A RAG assistant over a real project's docs that answers WITH citations — or says 'I don't know'.
Spine Faithfulness & citation evals + calibrated abstention.
Messy-doc extraction service
Turn messy PDFs/emails into validated structured data.
Spine Per-field accuracy evals + schema validation guardrails.
Multi-step research analyst
Research a question across sources → a cited report.
Spine Agent-loop + claim verification + trajectory evals.
Realtime collaborative editor
Multiple users editing shared state live — presence, cursors, conflict resolution.
Spine CRDT/OT + a convergence proof under concurrent edits.
Your own queue / API gateway
Build a message queue or API gateway from scratch — your design, not a clone.
Spine Delivery guarantees + metrics + a real load test.