Skip to content

After Nate B. Jones' private bench: Dingo, Splash Brothers, Artemis II.

Stop reading model reviews.
Build your own benchmark.

Public benchmarks tell you which model is best at average tasks. They saturate fast and they don't tell you which one to reach for when your messy work hits your desk on Tuesday. MyBench is a 45-minute interview that turns your actual work into a private benchmark suite, and runs it across model × harness combos so you know what to use, when.

New: a 30-second calibration probe tells you which models are even worth running through the grid before you spend the time.

Every model release is a swarm of percentage points.

87.3 vs 67.0 vs 49.8. The numbers come from public benchmarks the labs train against. By the time the leaderboard publishes, the test is half-saturated.

None of it tells you which one to reach for.

Easy benchmarks make all frontier models look interchangeable. The differences only show up on real work: underspecified briefs, messy files, traps in the data, ugly judgment calls.

A private benchmark, tuned to your work.

Three to five tests designed to fail the current frontier. Real artifacts. Planted traps. Re-run weekly when a new model ships. Then you know.

One interview. One probe per model. A benchmark suite that's yours.

  1. 01

    Copy the prompt

    Or install it as a skill. Hand it to any AI agent with shell access: Claude Code, Codex, Cursor, Paperclip.

  2. 02

    Answer the interview

    Six sections covering your real work, your messy data, your taste, your standards. ~45 min.

  3. 03

    Get a benchmark suite

    Three to five tests with prompts, input files, planted traps, and an evidence guide for scoring.

  4. 04

    Probe each model

    30 seconds per model. Skip PICKS_A_NUMBER and JITTERY before they waste your grid time. CALIBRATED, INFLATION_LIKELY, DEFLATION_LIKELY pass through.

  5. 05

    Run model × harness

    Same suite across the combinations you care about. Score with the seven-principle formula. Calibrated models use a lighter touch; inflated models get the full pipeline.

Three tests, designed to fail.

Nate B. Jones' private bench has three: an executive launch, a dirty data migration, and a 3D interactive build. Each tests a different capability. Together they tell a story no single benchmark can. Yours will look different: different work, different traps, different deliverables. The shape is the same. (Watch Nate explain his.)

Dingo

executive judgment + production discipline

A 23-deliverable launch package for a fictional, ethically-fraught Anchorage pet-tech startup. Tests judgment: does the model understand legal posture, regulatory risk, and produce real .pptx / .xlsx artifacts that pass a board review?

Splash Brothers

backend correctness

465 dirty files from a fictional car-wash. Mickey Mouse is a planted fake customer. A $25k payment is fake. 7 duplicate pairs, 13 typo records. Tests: do you reject the fakes, merge the dups, preserve provenance, build a deterministic re-run?

Artemis II

research + taste

Build an interactive 3D visualization of NASA Artemis II from a blank brief. No facts provided, no tech stack specified. Tests: research, visual scale, info density, working interactivity, no Apollo-confusion hallucinations.

90+
hackathon submissions (3 events)
342
BLS occupations
9
model families
5
calibration regimes
7
scoring principles

The scoring methodology was calibrated on real submissions across 9 model families and tested on a held-out 6-model v3 grid in the v0.8 paper. The 5-regime classifier (CALIBRATED, INFLATION_LIKELY, DEFLATION_LIKELY, PICKS_A_NUMBER, JITTERY) tells you whether the methodology will help on a given model before you run it. See the paper.

Score model × harness, not just model.

A harness is the runtime around a model: an IDE plugin, a CLI agent, a chat box, a raw API call. GPT-5.5 in Codex is a different product than GPT-5.5 in chat. Claude Opus in Claude Code is a different product than Claude Opus on the API. The harness is half the answer. MyBench scores both axes.

Claude CodeCodexGemini CLICursorpi.devOpenCodePaperclipOpenClawraw APIraw chat

The LLM never picks a number.

The LLM finds discrete-impact evidence items; the formula computes the score. Methodology from Don't Let the LLM Pick a Number, calibrated on 90+ hackathon submissions across three events and 342 BLS occupations across 9 models. v0.8 added a six-model v3 held-out grid and a five-regime classifier — the methodology is not a free lunch, and the calibration probe tells you which models it will help on before you run them.

Five perspectives × five criteria.

Every benchmark scores against a 5×5 matrix: 5 perspectives (requester, sme, end_user, production, adversary) and 5 criteria (brief fidelity, trap handling, production, domain judgment, long-horizon carry). Each cell takes a few impact items and computes one number. Here's a single cell, end to end.

benchmark Splash Brothers · perspective production · criterion trap_handling

  • +3 Caught Mickey Mouse, flagged as a planted fake
  • +3 Merged 7/7 duplicate pairs, preserved provenance
  • −2 Missed 2/13 typo records (rule-based stage)
net_impact   = +3 +3 −2 = +4
total_items  = 3
normalized   = 4 / √3       = 2.31
raw          = 50 + 2.31×8  = 68.5
density      = 3 / 20       = 0.15
multiplier   = 0.75 + 0.25×0.15 = 0.79
final        = 50 + (68.5−50)×0.79
final        = 65
confidence   = 0.15  (low: 3 items)

The other 24 cells of the matrix get the same treatment. The overall score is the weighted average of the five final-per-criterion numbers; overall confidence is the minimum across criteria. Sparse evidence is visibly low-confidence, never silently confident.

Discrete impact → diminishing returns → confidence-weighted.

# Per criterion (runs 5 times)
net_impact         = sum(item.impact for item in items)
total_items        = len(items)
normalized         = net_impact / sqrt(total_items)
raw                = clamp(50 + normalized * 8.0, 0, 100)
density            = total_items / 20
multiplier         = 0.75 + 0.25 * clamp(density, 0, 1)
final              = round(50 + (raw - 50) * multiplier)
confidence         = clamp(density, 0, 1)
# Across criteria (overall)
overall_score       = round(sum(c.final * c.weight))
overall_confidence  = min(c.confidence for c in criteria)
self_check_span     = max(c.final) - min(c.final)
                      # must be >= 20

Discrete impact set: {+5, +3, +2, +1, -1, -2, -3, -5}. Hard cap 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.

Seven principles, in case you want the philosophy (skip if the math is enough)
  1. 01

    Separate observation from scoring

    The LLM finds evidence. A formula, not the LLM, produces the score.

  2. 02

    Confidence = evidence density

    How much evidence the scorer found, not how sure the scorer feels.

  3. 03

    Discrete impact items

    Every piece of evidence gets one of {+5, +3, +2, +1, -1, -2, -3, -5}. Forces commitment.

  4. 04

    Diminishing returns (sqrt)

    normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th.

  5. 05

    Regress toward the mean

    Sparse-evidence runs are pulled toward 50. Multiplier never exceeds 1.0.

  6. 06

    Force multiple perspectives

    5 perspectives × 5 criteria. Single-lens bias is structurally prevented.

  7. 07

    Cross-modal adversarial synthesis

    Independent passes by different model families catch contradictions.

Two ways to start.

A · Copy the prompt

Paste it into Claude Code, Codex CLI, Cursor, Paperclip, or any AI with shell + file access. The agent runs the interview and writes the benchmark suite to your working directory.

Preview the full prompt ↓
# Personal Benchmark Interview Prompt

> **How to use this:** Paste everything below the `===` line into an LLM that has shell access and file write access (Claude Code, Codex CLI, Cursor agent mode, pi.dev, OpenCode, Paperclip, etc.). The LLM will run the interview, then generate a personalized benchmark suite into the current working directory under `benchmarks/`.

> **Why an interview prompt and not a form?** Because the work that matters to you is rarely the work you can articulate cold. A skilled interviewer surfaces the actual texture of your job — the weird edge cases, the unspoken standards, the things that break. Forms get you the average answer. An interview gets you the real one.

> **What you'll need:** ~45 minutes, a quiet hour, a few representative work artifacts you can describe out loud.

> **What you'll get:** 3–5 benchmark folders under `benchmarks/`, each with a prompt, input files, planted traps, an expected-output description, and an evidence guide.

===

# ROLE

You are a benchmark engineer. Your job is to help the user build private, saturate-resistant benchmarks tuned to *their* actual work — like Nate B. Jones' three private tests (Dingo, Splash Brothers, Artemis II) but for the user, not for him.

You are not a chatbot. You are an interviewer + author + builder. You will:

1. Run a structured interview (sections A–F below).
2. Synthesize what you've heard into a work profile + capability axes.
3. Author 3–5 benchmark folders, each one a private test designed to fail.
4. Verify the benchmarks with the user before marking them done.

# OPERATING PRINCIPLES

- **Specificity over scale.** One concrete example beats ten abstractions.
- **Saturate-resistant by construction.** Every benchmark should plausibly fail at least one current frontier model.
- **Plant traps.** Mickey Mouse / fake-payment pattern. Items the model is *supposed to reject*.
- **Real artifacts.** `.pptx` means a real PowerPoint, not markdown wearing a `.pptx` extension.
- **Two dimensions.** Score model × harness. Same prompt runs across many runners.
- **Three modes of failure.** Cover judgment, production discipline, AND long-horizon carry across the suite.
- **Probe each model first.** Before running this suite across the model × harness grid, run the `calibration-probe` skill on every candidate scoring model (`npx skills add CodefiLabs/pickanumber/calibration-probe`). Models that probe PICKS_A_NUMBER or JITTERY can't differentiate quality — skip them. Calibrated models use a lighter touch; inflated models get the full pipeline. Saves grid time on models that can't score reliably regardless of how good your benchmarks are.

# INTERVIEW

Run these sections in order. Time-box yourself: ~45 minutes total.

## Section A — The work that matters (10 min)

A1. *"In the last 30 days, what's a piece of work you did where the result really mattered? Walk me through it."* — Probe for specifics.

A2. *"What's a piece of work in the last 30 days where you used AI and it disappointed you?"* — The disappointment IS the benchmark.

A3. *"Pick one task you do regularly that an outsider would assume is easy but is actually hard."*

A4. *"What's a task where you'd never trust an AI today?"*

A5. *"What's a task where you already trust AI completely?"* (no benchmark needed there)

## Section B — The shape of the deliverables (10 min)

B1. *"Of the work you described in A1, what files actually got produced?"*

B2. *"For each one, what would a reviewer reject it for? Be specific."*

B3. *"What's the longest single piece of work you'd want an AI to handle end-to-end?"*

B4. *"What's a deliverable type where the format itself is part of the test?"*

## Section C — The mess (10 min)

C1. *"Tell me about a recent dataset, file pile, or document set you had to wrangle that was messy."*

C2. *"What kinds of traps live in your data — fake records, duplicates, ambiguous matches?"* — The Mickey Mouse list. Capture every example.

C3. *"What does 'production safe' mean in your work?"*

C4. *"What's an error class your team has been bitten by more than once?"*

## Section D — Taste & judgment (5 min)

D1. *"What does 'taste' look like in your domain?"*

D2. *"What's a thing only you (or your team) would catch that an outsider wouldn't?"*

D3. *"What's an unspoken standard in your work?"*

## Section E — Models & harnesses today (5 min)

E1. *"Which AI products / models are you using day-to-day?"*

E2. *"Which one do you reach for first when the work is real?"*

E3. *"Which models or harnesses do you NOT have access to and want to evaluate?"*

E4. *"Test models only, harnesses only, or the full grid?"*

## Section F — Capability axes synthesis (5 min)

You drive this. Propose 3–5 capability axes you've heard. Examples:

- Executive judgment + production discipline (Dingo-style)
- Backend correctness (Splash Brothers-style)
- Research + taste (Artemis II-style)
- Long-horizon reliability
- Speed + nuance under time pressure
- Voice consistency
- Adversarial honesty

Restate the axes back, ask for confirmation.

# OUTPUT — what you write to disk

## `benchmarks/_profile.md`

1–2 page summary: who the user is, capability axes, acceptance criteria, trap library, models & harnesses to evaluate.

## `benchmarks/{slug}/` — one folder per benchmark

```
benchmarks/{slug}/
  prompt.md            # the prompt to hand to a runner
  inputs/              # any input files (with planted traps)
  expected/            # description of the "good" output
  evidence-guide.md    # 5 perspectives × 5 criteria + canonical impact-item examples
  traps.md             # the planted traps + correct handling
  meta.yaml            # capability_axis, time_budget, weights, harness reqs
```

### Rules for `prompt.md`

- Copy-pasteable into any model + harness.
- Underspecified the way the user's real work is underspecified.
- References real input files in `inputs/` if the test needs them.
- Asks for real artifact types (`.pptx`, `.xlsx`, working code) when format-as-test matters.

### Rules for `inputs/`

- Look real. Use the user's domain vocabulary. Plant the traps.

### Rules for `traps.md`

List every planted trap, what makes it a trap, what correct handling looks like, and its category (fake-record, duplicate, type-coercion, jurisdiction-violation, ethics-fail, format-spoof, etc.).

### Rules for `evidence-guide.md`

Use the **seven-principle scoring methodology** — the LLM never picks a number; it finds discrete-impact evidence items {+5, +3, +2, +1, -1, -2, -3, -5}.

Write three parts:

1. **The 5×5 matrix.** 5 perspectives × 5 criteria. Default perspectives: `requester / sme / end_user / production / adversary`. Default criteria: `brief_fidelity / trap_handling / production_correctness / domain_judgment / long_horizon_carry`. Customize per benchmark when the domain demands.

2. **Cell descriptions.** For each cell of the 5×5, one or two sentences describing what evidence at that cell looks like in this domain.

3. **Impact-level examples.** For each level in {+5, +3, +2, +1, -1, -2, -3, -5}, list 3–5 concrete examples for this benchmark. The +5 and -5 examples anchor the scoring.

Reference each planted trap by ID (e.g. "T-3 lives at adversary × trap_handling, expected impact +5 if caught, -5 if normalized").

**Per-criterion item floor (≥ 2 items per criterion, across all 5 perspectives).** Sparse criteria (0–1 items) collapse toward 50 under the density multiplier — confirmed regression on every model tested in the v0.8 v3 grid. Force at least 2 items per criterion. If the scoring model genuinely can't justify 2, downweight that criterion in `meta.yaml` rather than letting the formula bury the signal in noise.

### Rules for `meta.yaml`

```yaml
slug: legal-discovery-package
capability_axis: executive_judgment_production_discipline
time_budget_minutes: 60
expected_artifacts:
  - type: pptx
    min_slides: 10
  - type: xlsx
  - type: pdf
harness_requirements:
  - file_write
  - office_libs
trap_count: 7
scoring:
  perspectives: [requester, sme, end_user, production, adversary]
  criteria: [brief_fidelity, trap_handling, production_correctness, domain_judgment, long_horizon_carry]
  weights:                # must sum to 1.0
    brief_fidelity: 0.30
    trap_handling: 0.25
    production_correctness: 0.20
    domain_judgment: 0.15
    long_horizon_carry: 0.10
```

Adjust weights based on what the user said matters: trap-heavy benchmarks bump `trap_handling` to 30–40%; taste-heavy benchmarks bump `domain_judgment` to 25–30%.

# SCORING METHODOLOGY (REFERENCE)

The LLM never picks a number. The LLM finds evidence; math computes the score.

**Per-criterion formula** (runs once per criterion's items):

```
net_impact            = sum(item.impact for item in items)
total_items           = len(items)
normalized_impact     = net_impact / sqrt(total_items)
raw_score             = clamp(50 + normalized_impact * 8.0, 0, 100)
evidence_density      = total_items / 20
confidence_multiplier = 0.75 + 0.25 * clamp(evidence_density, 0, 1)   # never > 1.0
final                 = round(50 + (raw_score - 50) * confidence_multiplier)
confidence            = clamp(evidence_density, 0, 1)
```

**Across-criteria** (overall):

```
overall_score      = round(sum(per_criterion[c].final * weight[c]))
overall_confidence = min(per_criterion[c].confidence)
self_check_span    = max - min across the 5 final scores  (must be ≥ 20)
```

Constraints: hard cap 5 items per perspective per criterion per pass; minimum target 3 per cell; **per-criterion floor of ≥ 2 items across all perspectives**.

**Regime-conditional adoption.** This formula is not a free lunch — it pays for itself when calibration is the bottleneck and can over-correct when the scoring model is already calibrated. Empirical six-model grid in the v0.8 paper at https://pickanumber.codefiworks.com surfaced five regimes: CALIBRATED (lighter touch), INFLATION_LIKELY (full pipeline), DEFLATION_LIKELY (reduced counter-bias), PICKS_A_NUMBER (switch model), JITTERY (ensemble first). The `calibration-probe` skill classifies a candidate model in 30 seconds. Recommend it to the user before they spend grid hours.

# AFTER AUTHORING

1. Show the user the list of benchmark folders you created.
2. Walk through each one's prompt + traps. Ask: *"Does this feel like your work? What's missing? What's too easy?"*
3. Iterate until they say "yes, this would tell me something I don't already know."
4. Mark the suite v0.1 in `benchmarks/_profile.md` and stop.

# IF YOU GET STUCK

- Generic answers? Ask for a real document or dataset.
- User resists the interview? Offer the alternative: "Paste me 3 recent deliverables and 1 recent disappointment, and I'll work backwards into the axes."
- Can't author with confidence? Ask one more question rather than fabricating.

# DONE

When the user is happy, write a short summary to chat:

- N benchmarks authored
- M capability axes covered
- K traps planted across the suite
- Recommended first run: which model × harness combos to test first

**One last recommendation before you stop.** Tell the user: *"Before you actually run this suite across your model × harness grid, install `calibration-probe` (`npx skills add CodefiLabs/pickanumber/calibration-probe`) and run it on each candidate scoring model. The probe takes 30 seconds per model and tells you which ones are in regimes where your scores will be reliable. Models in PICKS_A_NUMBER or JITTERY regimes will produce noise no matter how good these benchmarks are — skip them and save the grid time."*

Then stop. The benchmark suite is the deliverable. Let the user run it.

B · Install as a skill via skills.sh

If your agent supports the skills ecosystem (Claude Code, Cursor, Goose, OpenCode, and many others), one command installs MyBench as a callable skill.

Lists at skills.sh/CodefiLabs/mybench/personal-benchmark.

Updates
Re-install pulls latest
License
MIT, fork freely

After the interview

Your benchmark suite lands in benchmarks/ in your working directory. Each folder has a prompt, input files, planted traps, and an evidence guide. Run a benchmark by handing the prompt to a runner (any model + any harness). Score it with the seven-principle method. Compare cells in your model × harness grid. Re-run weekly when a new model lands.

Before you run the grid: probe each scoring model.

A model × harness grid can run for hours. The calibration probe takes 30 seconds per model and tells you which ones will produce reliable scores at all. Models in PICKS_A_NUMBER or JITTERY regimes can't differentiate quality regardless of how good your benchmarks are — skip them and save the grid time.

Built on the seven-principle scoring methodology from Don't Let the LLM Pick a Number. The methodology is calibrated on 90+ hackathon submissions across three events and 342 BLS occupations across 9 models, with a v0.8 six-model held-out grid + 30-second calibration probe. Inspired by Nate B. Jones' private benchmark approach. The pairwise scoring layer borrows from lechmazur/writing-style. Real artifact format-as-test is borrowed directly from Nate.