Evals

How I keep the AI panels on this site honest: a rubric, a few prompt iterations, and human scores. Open a section to see its history, or run your own prompt and grade the output yourself.

What’s an eval?

A repeatable way to tell if a prompt is actually getting better

When you tweak a prompt, gut feel can’t tell you whether the new version is genuinely better or just different. An eval is a small process — write a rubric, run prompts on the same inputs, score the outputs — that lets you compare iterations honestly and ship the one that wins on the dimensions you care about.

  1. 01
    Define a rubric

    Pick the qualities that matter for this output and assign each a weight that sums to one.

  2. 02
    Write versions

    Iterate on the prompt — each revision is a hypothesis about why the output should be better.

  3. 03
    Run on real tasks

    Generate outputs for a fixed set of articles so versions can be compared like-for-like.

  4. 04
    Score against the rubric

    Score each output 1–5 on every dimension. Notes capture the reasoning, not just the number.

  5. 05
    Ship the winner

    Promote the highest-scoring version to production; keep the history so the why stays legible.

The rubric for this suite
Tone20%
Accuracy25%
Format adherence20%
Concision15%
Voice consistency20%

Sections in the panel

Click a section to open the runner with its shipped prompt loaded.
Encyclopaedic
4 versions·shipped v4

TL;DR

Same neutral two-paragraph register, but with a hard ban on any self-reference to the source (no "this article", "the text", "the piece", etc.) and a requirement that the second paragraph anchor at least one point in something concrete from the article.

5.0 / 5
Playful
5 versions·shipped v3

Explain Like I'm 5

Keep the analogies; allow up to two well-chosen emojis so the section reads as visually distinct from the rest of the panel.

3.6 / 5
Steel-man
4 versions·shipped v4

Counterpoint

Same charitable steel-man framing as v3, but expand the ask from 2–3 to 3–4 opposing positions so the section captures a fuller range of perspectives the article warrants.

Not scored yet
Q&A
3 versions·shipped v3

Questions

Pair each question with an answer drawn directly from the article, in the same neutral register as the TL;DR.

Not scored yet