Throughline
← Back to the ladder

Act 3 — Build

Shipyard

Knowledge isn't proof. Build something real — then have it reviewed like a staff engineer would.

Shipyard is a guided build loop you run in your own dev environment: a brief gives you a real problem and milestone tickets, a coach helps when you're stuck, and a grader scores the result.The coaching and grading run in Claude Code — no signup, no server. This page is the catalog.

01

Pick a brief

A real, job-market-relevant problem with constraints — pre-broken into PR-sized milestone tickets.

02

Design first

Write a short design doc. It gets reviewed like a staff engineer would — before you write code.

03

Build in your own repo

Your IDE, your stack, your GitHub. Real engineer mode. The brief defines the target, never the implementation.

04

Get coached

Stuck, off-track, or want to improve? A coach guides you — hint → direction → how it should be done — and explains the why.

05

Get graded

Scored on system design, correctness, production readiness, docs, and "would this shine in a portfolio" — with a punch list to ship-grade.

06

Ship it

End state: a deployed, documented, portfolio-ready project — and a readiness signal you can point at.

Flagship brief · AI / agents

Codebase Q&A + PR-review agent

An agent that answers questions about a real codebase (RAG over a repo) and reviews PR diffs with useful comments.

The spine: The eval harness — curated cases (happy/recoverable/unrecoverable/adversarial), LLM-as-judge gating CI, tracing + cost observability, guardrails. ~88% of agents never ship because the harness is too fragile; this is the part that separates a demo from a product.

  1. M1Design doc + eval plan (the contract)
  2. M2Thin vertical slice (one real answer)
  3. M3The eval harness + golden set
  4. M4LLM-as-judge + calibration
  5. M5CI eval gate (block regressions)
  6. M6Tracing + cost/latency observability
  7. M7Guardrails + abstention
  8. M8Deploy + README + demo

More briefs — planned

A catalog, not one project

The flagship is ready; these are next. Each is deliberately distinct from a typical CRUD app — the durable skill (the "spine") is the point, not the surface.

AI / agentsPlanned

Docs assistant that abstains

A RAG assistant over a real project's docs that answers WITH citations — or says 'I don't know'.

Spine Faithfulness & citation evals + calibrated abstention.

AI / agentsPlanned

Messy-doc extraction service

Turn messy PDFs/emails into validated structured data.

Spine Per-field accuracy evals + schema validation guardrails.

AI / agentsPlanned

Multi-step research analyst

Research a question across sources → a cited report.

Spine Agent-loop + claim verification + trajectory evals.

Realtime / distributedPlanned

Realtime collaborative editor

Multiple users editing shared state live — presence, cursors, conflict resolution.

Spine CRDT/OT + a convergence proof under concurrent edits.

Systems / infraPlanned

Your own queue / API gateway

Build a message queue or API gateway from scratch — your design, not a clone.

Spine Delivery guarantees + metrics + a real load test.