Evals

How I keep the AI panels on this site honest: a rubric, a few prompt iterations, and human scores. Open a section to see its history, or run your own prompt and grade the output yourself.

What’s an eval?

A repeatable way to tell if a prompt is actually getting better

When you tweak a prompt, gut feel can’t tell you whether the new version is genuinely better or just different. An eval is a small process — write a rubric, run prompts on the same inputs, score the outputs — that lets you compare iterations honestly and ship the one that wins on the dimensions you care about.

01
Define a rubric
Pick the qualities that matter for this output and assign each a weight that sums to one.
02
Write versions
Iterate on the prompt — each revision is a hypothesis about why the output should be better.
03
Run on real tasks
Generate outputs for a fixed set of articles so versions can be compared like-for-like.
04
Score against the rubric
Score each output 1–5 on every dimension. Notes capture the reasoning, not just the number.
05
Ship the winner
Promote the highest-scoring version to production; keep the history so the why stays legible.

The rubric for this suite

Tone20%

Accuracy25%

Format adherence20%

Concision15%

Voice consistency20%

Sections in the panel

Click a section to open the runner with its shipped prompt loaded.

Encyclopaedic

4 versions·shipped v4

TL;DR

Same neutral two-paragraph register, but with a hard ban on any self-reference to the source (no "this article", "the text", "the piece", etc.) and a requirement that the second paragraph anchor at least one point in something concrete from the article.

5.0 / 5

Playful

5 versions·shipped v3

Explain Like I'm 5

Keep the analogies; allow up to two well-chosen emojis so the section reads as visually distinct from the rest of the panel.

3.6 / 5

Steel-man

4 versions·shipped v4

Counterpoint

Same charitable steel-man framing as v3, but expand the ask from 2–3 to 3–4 opposing positions so the section captures a fuller range of perspectives the article warrants.

Not scored yet

Q&A

3 versions·shipped v3

Questions

Pair each question with an answer drawn directly from the article, in the same neutral register as the TL;DR.

Not scored yet