labs public example parallel to airbnb ai101 2026

greenroom.

Greenroom is my public example of the type of AI enablement solution and thinking I bring to Fortune 500 work. I built it in parallel to designing performance-task modules for Airbnb's AI101 program.

A development copilot built to change behavior, not just deliver content. I spotted where talent development stalls, then scoped, built, and evaluated a working applied-AI fix, solo, before a single engineer was involved. Real RAG, a real eval harness, and a documented path to production. Live prototype below.

open the live demo view the code

applied ai
rag
agentic loop
eval harness
behavior change
solo build

the live prototype

see it run.

The actual working app, embedded live. Open it full screen to ask a real question and watch it retrieve, cite, refuse the out-of-scope ones, and follow up.

greenroom-strand.netlify.app open full screen ↗

open the live demo ↗

the opportunity i saw

development stalls in a catalog nobody opens.

Teams have good content and no signal that it changed anything. The opening: an assistant that meets people at the moment of need, and closes the loop on whether the advice actually landed.

most ai tools stop at delivering content. greenroom is built around the part they skip: behavior change.

what i built

a copilot that closes the loop.

An employee asks a real question in plain words. Greenroom answers from a curated people development library using real retrieval (RAG) with citations, or an honest "not covered". It proposes a verb-first next step, then follows up with a "did it land?" check that records the behavior, not just the click. Built solo with Claude Code across six surfaces.

ask in plain words

A real question at the moment of need. No course catalog to dig through, no keyword guessing.

grounded answer or honest refusal

Answers come from the curated library with citations. Outside it, the answer is "not covered", not a guess.

a verb-first next step

Every response ends in one concrete action the person can actually do, not a reading list.

a "did it land?" follow-up

It checks back and records the behavior, not just the click. Kirkpatrick Level 3, built into the flow.

how it works

real rag, with its wiring on display.

Hosted Voyage embeddings over a governed library, similarity retrieval with a refusal threshold. An agentic loop runs underneath: a cheaper model triages and routes each question, answer, clarify, or refuse, before a stronger model writes the grounded response. Every answer exposes its own wiring on a run sheet.

real retrieval

Hosted Voyage embeddings over a governed library. Similarity retrieval with a refusal threshold, so weak matches get declined instead of faked.

an agentic loop

A cheaper model triages and routes each question before a stronger model writes the answer. Right-sized compute per step.

a run sheet

Every answer shows how it was produced: what was retrieved, how it routed, what it grounded against. Nothing hidden.

how i proved it works

a real eval harness. 22 of 22 pass.

A labeled evaluation suite runs all 22 cases through the real pipeline: 14 in-scope, 4 out-of-scope, 4 adversarial and prompt-injection. Not vibes, a harness.

22/22 cases pass, full pipeline

100% retrieval hit-rate

100% refusal accuracy · 4/4 red-team declined

92/100 judge score · mean, range 88 to 97

Two grounding checks on purpose: 100% deterministic citation integrity in code, plus an independent judge model that scores how well each answer traces to its sources, a mean of 92/100. So the system never grades its own work.

responsible by design

guardrails from the first commit.

Stays in scope. Answers come only from inside the governed library. No freelancing.
No judgments about people. It never rates or decides about a real person.
Injection-resistant. It ignores instructions hidden in documents or in the question itself.
A documented review path. A clear route to legal and infosec privacy and fairness review before any real launch.

from prototype to production

a documented handoff.

A prototype that knows what production would take.

Latency and cost measured per answer.
The eval suite as a CI release gate, so quality is checked on every change.
Monitoring on groundedness and refusals once it is live.
A single-team rollout before anything wider.
A clear buy-vs-build line for what to keep and what to replace.

what it demonstrates

the whole loop, solo.

Spotting the highest-leverage opportunity, scoping it, and building a working prototype solo, before engaging a single engineer. Validating it with a real eval harness. Defining the production handoff: cost and latency per answer, the evals as a release gate, monitoring, a clear buy-vs-build line, and a privacy and fairness review from the start. This is the space between strategy and engineering, owned end to end. Outcomes, not outputs.

claude code rag voyage embeddings agentic routing eval harness netlify

see it run

try greenroom.

A working prototype. Ask it a real question, watch it retrieve, refuse the out-of-scope ones, and follow up on whether the advice landed.

open the live demo view the code on github →