Evals
How I keep the AI panels on this site honest: a rubric, a few prompt iterations, and human scores. Open a section to see its history, or run your own prompt and grade the output yourself.
A repeatable way to tell if a prompt is actually getting better
When you tweak a prompt, gut feel can’t tell you whether the new version is genuinely better or just different. An eval is a small process — write a rubric, run prompts on the same inputs, score the outputs — that lets you compare iterations honestly and ship the one that wins on the dimensions you care about.
- 01Define a rubric
Pick the qualities that matter for this output and assign each a weight that sums to one.
- 02Write versions
Iterate on the prompt — each revision is a hypothesis about why the output should be better.
- 03Run on real tasks
Generate outputs for a fixed set of articles so versions can be compared like-for-like.
- 04Score against the rubric
Score each output 1–5 on every dimension. Notes capture the reasoning, not just the number.
- 05Ship the winner
Promote the highest-scoring version to production; keep the history so the why stays legible.
Sections in the panel
Click a section to open the runner with its shipped prompt loaded.TL;DR
Same neutral two-paragraph register, but with a hard ban on any self-reference to the source (no "this article", "the text", "the piece", etc.) and a requirement that the second paragraph anchor at least one point in something concrete from the article.
Explain Like I'm 5
Keep the analogies; allow up to two well-chosen emojis so the section reads as visually distinct from the rest of the panel.
Counterpoint
Same charitable steel-man framing as v3, but expand the ask from 2–3 to 3–4 opposing positions so the section captures a fuller range of perspectives the article warrants.
Questions
Pair each question with an answer drawn directly from the article, in the same neutral register as the TL;DR.