back to labs
lab self-directed prototype applied llm · rag 2026

greenroom.

A development copilot built to change behavior, not just deliver content.

I spotted where talent development stalls, then scoped, built, and evaluated a working applied-AI fix, solo, before a single engineer was involved. Real RAG, a real eval harness, and a documented path to production. Live prototype below.

the live prototype

see it run.

The actual working app, embedded live. Open it full screen to ask a real question and watch it retrieve, cite, refuse the out-of-scope ones, and follow up.

greenroom-strand.netlify.app open full screen ↗
the opportunity i saw

development stalls in a catalog nobody opens.

Teams have good content and no signal that it changed anything. The opening: an assistant that meets people at the moment of need, and closes the loop on whether the advice actually landed.

most ai tools stop at delivering content. greenroom is built around the part they skip: behavior change.

what i built

a copilot that closes the loop.

An employee asks a real question in plain words. Greenroom answers from a curated people development library using real retrieval (RAG) with citations, or an honest "not covered". It proposes a verb-first next step, then follows up with a "did it land?" check that records the behavior, not just the click. Built solo with Claude Code across six surfaces.

01

ask in plain words

A real question at the moment of need. No course catalog to dig through, no keyword guessing.

02

grounded answer or honest refusal

Answers come from the curated library with citations. Outside it, the answer is "not covered", not a guess.

03

a verb-first next step

Every response ends in one concrete action the person can actually do, not a reading list.

04

a "did it land?" follow-up

It checks back and records the behavior, not just the click. Kirkpatrick Level 3, built into the flow.

how it works

real rag, with its wiring on display.

Hosted Voyage embeddings over a governed library, similarity retrieval with a refusal threshold. An agentic loop runs underneath: a cheaper model triages and routes each question, answer, clarify, or refuse, before a stronger model writes the grounded response. Every answer exposes its own wiring on a run sheet.

01

real retrieval

Hosted Voyage embeddings over a governed library. Similarity retrieval with a refusal threshold, so weak matches get declined instead of faked.

02

an agentic loop

A cheaper model triages and routes each question before a stronger model writes the answer. Right-sized compute per step.

03

a run sheet

Every answer shows how it was produced: what was retrieved, how it routed, what it grounded against. Nothing hidden.

how i proved it works

a real eval harness. 22 of 22 pass.

A labeled evaluation suite runs all 22 cases through the real pipeline: 14 in-scope, 4 out-of-scope, 4 adversarial and prompt-injection. Not vibes, a harness.

22/22 cases pass, full pipeline
100% retrieval hit-rate
100% refusal accuracy · 4/4 red-team declined
92/100 judge score · mean, range 88 to 97

Two grounding checks on purpose: 100% deterministic citation integrity in code, plus an independent judge model that scores how well each answer traces to its sources, a mean of 92/100. So the system never grades its own work.

responsible by design

guardrails from the first commit.

from prototype to production

a documented handoff.

A prototype that knows what production would take.

what it demonstrates

the whole loop, solo.

Spotting the highest-leverage opportunity, scoping it, and building a working prototype solo, before engaging a single engineer. Validating it with a real eval harness. Defining the production handoff: cost and latency per answer, the evals as a release gate, monitoring, a clear buy-vs-build line, and a privacy and fairness review from the start. This is the space between strategy and engineering, owned end to end. Outcomes, not outputs.

claude code rag voyage embeddings agentic routing eval harness netlify
see it run

try greenroom.

A working prototype. Ask it a real question, watch it retrieve, refuse the out-of-scope ones, and follow up on whether the advice landed.

open the live demo