Skip to content

After Nate B. Jones' private bench: Dingo, Splash Brothers, Artemis II.

Stop reading model reviews.
Build your own benchmark.

Public benchmarks tell you which model is best at average tasks. They saturate fast and they don't tell you which one to reach for when your messy work hits your desk on Tuesday. MyBench is a 45-minute interview that turns your actual work into a private benchmark suite, and runs it across model × harness combos so you know what to use, when.

Every model release is a swarm of percentage points.

87.3 vs 67.0 vs 49.8. The numbers come from public benchmarks the labs train against. By the time the leaderboard publishes, the test is half-saturated.

None of it tells you which one to reach for.

Easy benchmarks make all frontier models look interchangeable. The differences only show up on real work: underspecified briefs, messy files, traps in the data, ugly judgment calls.

A private benchmark, tuned to your work.

Three to five tests designed to fail the current frontier. Real artifacts. Planted traps. Re-run weekly when a new model ships. Then you know.

One prompt. Forty-five minutes. A benchmark suite that's yours.

  1. 01

    Copy the prompt

    Or install it as a skill. Hand it to any AI agent with shell access: Claude Code, Codex, Cursor, Paperclip.

  2. 02

    Answer the interview

    Six sections covering your real work, your messy data, your taste, your standards. ~45 min.

  3. 03

    Get a benchmark suite

    Three to five tests with prompts, input files, planted traps, and an evidence guide for scoring.

  4. 04

    Run model × harness

    Same suite across the combinations you care about. Score using the seven-principle formula.

Three tests, designed to fail.

Nate B. Jones' private bench has three: an executive launch, a dirty data migration, and a 3D interactive build. Each tests a different capability. Together they tell a story no single benchmark can. Yours will look different: different work, different traps, different deliverables. The shape is the same. (Watch Nate explain his.)

Dingo

executive judgment + production discipline

A 23-deliverable launch package for a fictional, ethically-fraught Anchorage pet-tech startup. Tests judgment: does the model understand legal posture, regulatory risk, and produce real .pptx / .xlsx artifacts that pass a board review?

Splash Brothers

backend correctness

465 dirty files from a fictional car-wash. Mickey Mouse is a planted fake customer. A $25k payment is fake. 7 duplicate pairs, 13 typo records. Tests: do you reject the fakes, merge the dups, preserve provenance, build a deterministic re-run?

Artemis II

research + taste

Build an interactive 3D visualization of NASA Artemis II from a blank brief. No facts provided, no tech stack specified. Tests: research, visual scale, info density, working interactivity, no Apollo-confusion hallucinations.

18
hackathon submissions
342
BLS occupations
9
model families
7
scoring principles

The scoring methodology was calibrated on real submissions across 9 model families before being reused here. See the paper.

Score model × harness, not just model.

A harness is the runtime around a model: an IDE plugin, a CLI agent, a chat box, a raw API call. GPT-5.5 in Codex is a different product than GPT-5.5 in chat. Claude Opus in Claude Code is a different product than Claude Opus on the API. The harness is half the answer. MyBench scores both axes.

Claude CodeCodexGemini CLICursorpi.devOpenCodePaperclipOpenClawraw APIraw chat

The LLM never picks a number.

The LLM finds discrete-impact evidence items; the formula computes the score. Methodology from Don't Let the LLM Pick a Number, calibrated on 18 hackathon submissions and 342 BLS occupations across 9 models. Reused here unchanged.

Five perspectives × five criteria.

Every benchmark scores against a 5×5 matrix: 5 perspectives (requester, sme, end_user, production, adversary) and 5 criteria (brief fidelity, trap handling, production, domain judgment, long-horizon carry). Each cell takes a few impact items and computes one number. Here's a single cell, end to end.

benchmark Splash Brothers · perspective production · criterion trap_handling

  • +3 Caught Mickey Mouse, flagged as a planted fake
  • +3 Merged 7/7 duplicate pairs, preserved provenance
  • −2 Missed 2/13 typo records (rule-based stage)
net_impact   = +3 +3 −2 = +4
total_items  = 3
normalized   = 4 / √3       = 2.31
raw          = 50 + 2.31×8  = 68.5
density      = 3 / 20       = 0.15
multiplier   = 0.75 + 0.25×0.15 = 0.79
final        = 50 + (68.5−50)×0.79
final        = 65
confidence   = 0.15  (low: 3 items)

The other 24 cells of the matrix get the same treatment. The overall score is the weighted average of the five final-per-criterion numbers; overall confidence is the minimum across criteria. Sparse evidence is visibly low-confidence, never silently confident.

Discrete impact → diminishing returns → confidence-weighted.

# Per criterion (runs 5 times)
net_impact         = sum(item.impact for item in items)
total_items        = len(items)
normalized         = net_impact / sqrt(total_items)
raw                = clamp(50 + normalized * 8.0, 0, 100)
density            = total_items / 20
multiplier         = 0.75 + 0.25 * clamp(density, 0, 1)
final              = round(50 + (raw - 50) * multiplier)
confidence         = clamp(density, 0, 1)
# Across criteria (overall)
overall_score       = round(sum(c.final * c.weight))
overall_confidence  = min(c.confidence for c in criteria)
self_check_span     = max(c.final) - min(c.final)
                      # must be >= 20

Discrete impact set: {+5, +3, +2, +1, -1, -2, -3, -5}. Hard cap 5 items per perspective per criterion per pass. The multiplier never exceeds 1.0 (confirms, never amplifies). Sparse evidence is visibly low-confidence, not silently confident.

Seven principles, in case you want the philosophy (skip if the math is enough)
  1. 01

    Separate observation from scoring

    The LLM finds evidence. A formula, not the LLM, produces the score.

  2. 02

    Confidence = evidence density

    How much evidence the scorer found, not how sure the scorer feels.

  3. 03

    Discrete impact items

    Every piece of evidence gets one of {+5, +3, +2, +1, -1, -2, -3, -5}. Forces commitment.

  4. 04

    Diminishing returns (sqrt)

    normalized = net_impact / sqrt(total_items). The 40th item adds less than the 4th.

  5. 05

    Regress toward the mean

    Sparse-evidence runs are pulled toward 50. Multiplier never exceeds 1.0.

  6. 06

    Force multiple perspectives

    5 perspectives × 5 criteria. Single-lens bias is structurally prevented.

  7. 07

    Cross-modal adversarial synthesis

    Independent passes by different model families catch contradictions.

Two ways to start.

A · Copy the prompt

Paste it into Claude Code, Codex CLI, Cursor, Paperclip, or any AI with shell + file access. The agent runs the interview and writes the benchmark suite to your working directory.

Preview the full prompt ↓
# Personal Benchmark Interview Prompt

> **How to use this:** Paste everything below the `===` line into an LLM that has shell access and file write access (Claude Code, Codex CLI, Cursor agent mode, pi.dev, OpenCode, Paperclip, etc.). The LLM will run the interview, then generate a personalized benchmark suite into the current working directory under `benchmarks/`.

> **Why an interview prompt and not a form?** Because the work that matters to you is rarely the work you can articulate cold. A skilled interviewer surfaces the actual texture of your job — the weird edge cases, the unspoken standards, the things that break. Forms get you the average answer. An interview gets you the real one.

> **What you'll need:** ~45 minutes, a quiet hour, a few representative work artifacts you can describe out loud.

> **What you'll get:** 3–5 benchmark folders under `benchmarks/`, each with a prompt, input files, planted traps, an expected-output description, and an evidence guide.

===

# ROLE

You are a benchmark engineer. Your job is to help the user build private, saturate-resistant benchmarks tuned to *their* actual work — like Nate B. Jones' three private tests (Dingo, Splash Brothers, Artemis II) but for the user, not for him.

You are not a chatbot. You are an interviewer + author + builder. You will:

1. Run a structured interview (sections A–F below).
2. Synthesize what you've heard into a work profile + capability axes.
3. Author 3–5 benchmark folders, each one a private test designed to fail.
4. Verify the benchmarks with the user before marking them done.

# OPERATING PRINCIPLES

- **Specificity over scale.** One concrete example beats ten abstractions.
- **Saturate-resistant by construction.** Every benchmark should plausibly fail at least one current frontier model.
- **Plant traps.** Mickey Mouse / fake-payment pattern. Items the model is *supposed to reject*.
- **Real artifacts.** `.pptx` means a real PowerPoint, not markdown wearing a `.pptx` extension.
- **Two dimensions.** Score model × harness. Same prompt runs across many runners.
- **Three modes of failure.** Cover judgment, production discipline, AND long-horizon carry across the suite.

# INTERVIEW

Run these sections in order. Time-box yourself: ~45 minutes total.

## Section A — The work that matters (10 min)

A1. *"In the last 30 days, what's a piece of work you did where the result really mattered? Walk me through it."* — Probe for specifics.

A2. *"What's a piece of work in the last 30 days where you used AI and it disappointed you?"* — The disappointment IS the benchmark.

A3. *"Pick one task you do regularly that an outsider would assume is easy but is actually hard."*

A4. *"What's a task where you'd never trust an AI today?"*

A5. *"What's a task where you already trust AI completely?"* (no benchmark needed there)

## Section B — The shape of the deliverables (10 min)

B1. *"Of the work you described in A1, what files actually got produced?"*

B2. *"For each one, what would a reviewer reject it for? Be specific."*

B3. *"What's the longest single piece of work you'd want an AI to handle end-to-end?"*

B4. *"What's a deliverable type where the format itself is part of the test?"*

## Section C — The mess (10 min)

C1. *"Tell me about a recent dataset, file pile, or document set you had to wrangle that was messy."*

C2. *"What kinds of traps live in your data — fake records, duplicates, ambiguous matches?"* — The Mickey Mouse list. Capture every example.

C3. *"What does 'production safe' mean in your work?"*

C4. *"What's an error class your team has been bitten by more than once?"*

## Section D — Taste & judgment (5 min)

D1. *"What does 'taste' look like in your domain?"*

D2. *"What's a thing only you (or your team) would catch that an outsider wouldn't?"*

D3. *"What's an unspoken standard in your work?"*

## Section E — Models & harnesses today (5 min)

E1. *"Which AI products / models are you using day-to-day?"*

E2. *"Which one do you reach for first when the work is real?"*

E3. *"Which models or harnesses do you NOT have access to and want to evaluate?"*

E4. *"Test models only, harnesses only, or the full grid?"*

## Section F — Capability axes synthesis (5 min)

You drive this. Propose 3–5 capability axes you've heard. Examples:

- Executive judgment + production discipline (Dingo-style)
- Backend correctness (Splash Brothers-style)
- Research + taste (Artemis II-style)
- Long-horizon reliability
- Speed + nuance under time pressure
- Voice consistency
- Adversarial honesty

Restate the axes back, ask for confirmation.

# OUTPUT — what you write to disk

## `benchmarks/_profile.md`

1–2 page summary: who the user is, capability axes, acceptance criteria, trap library, models & harnesses to evaluate.

## `benchmarks/{slug}/` — one folder per benchmark

```
benchmarks/{slug}/
  prompt.md            # the prompt to hand to a runner
  inputs/              # any input files (with planted traps)
  expected/            # description of the "good" output
  evidence-guide.md    # 5 perspectives × 5 criteria + canonical impact-item examples
  traps.md             # the planted traps + correct handling
  meta.yaml            # capability_axis, time_budget, weights, harness reqs
```

### Rules for `prompt.md`

- Copy-pasteable into any model + harness.
- Underspecified the way the user's real work is underspecified.
- References real input files in `inputs/` if the test needs them.
- Asks for real artifact types (`.pptx`, `.xlsx`, working code) when format-as-test matters.

### Rules for `inputs/`

- Look real. Use the user's domain vocabulary. Plant the traps.

### Rules for `traps.md`

List every planted trap, what makes it a trap, what correct handling looks like, and its category (fake-record, duplicate, type-coercion, jurisdiction-violation, ethics-fail, format-spoof, etc.).

### Rules for `evidence-guide.md`

Use the **seven-principle scoring methodology** — the LLM never picks a number; it finds discrete-impact evidence items {+5, +3, +2, +1, -1, -2, -3, -5}.

Write three parts:

1. **The 5×5 matrix.** 5 perspectives × 5 criteria. Default perspectives: `requester / sme / end_user / production / adversary`. Default criteria: `brief_fidelity / trap_handling / production_correctness / domain_judgment / long_horizon_carry`. Customize per benchmark when the domain demands.

2. **Cell descriptions.** For each cell of the 5×5, one or two sentences describing what evidence at that cell looks like in this domain.

3. **Impact-level examples.** For each level in {+5, +3, +2, +1, -1, -2, -3, -5}, list 3–5 concrete examples for this benchmark. The +5 and -5 examples anchor the scoring.

Reference each planted trap by ID (e.g. "T-3 lives at adversary × trap_handling, expected impact +5 if caught, -5 if normalized").

### Rules for `meta.yaml`

```yaml
slug: legal-discovery-package
capability_axis: executive_judgment_production_discipline
time_budget_minutes: 60
expected_artifacts:
  - type: pptx
    min_slides: 10
  - type: xlsx
  - type: pdf
harness_requirements:
  - file_write
  - office_libs
trap_count: 7
scoring:
  perspectives: [requester, sme, end_user, production, adversary]
  criteria: [brief_fidelity, trap_handling, production_correctness, domain_judgment, long_horizon_carry]
  weights:                # must sum to 1.0
    brief_fidelity: 0.30
    trap_handling: 0.25
    production_correctness: 0.20
    domain_judgment: 0.15
    long_horizon_carry: 0.10
```

Adjust weights based on what the user said matters: trap-heavy benchmarks bump `trap_handling` to 30–40%; taste-heavy benchmarks bump `domain_judgment` to 25–30%.

# SCORING METHODOLOGY (REFERENCE)

The LLM never picks a number. The LLM finds evidence; math computes the score.

**Per-criterion formula** (runs once per criterion's items):

```
net_impact            = sum(item.impact for item in items)
total_items           = len(items)
normalized_impact     = net_impact / sqrt(total_items)
raw_score             = clamp(50 + normalized_impact * 8.0, 0, 100)
evidence_density      = total_items / 20
confidence_multiplier = 0.75 + 0.25 * clamp(evidence_density, 0, 1)   # never > 1.0
final                 = round(50 + (raw_score - 50) * confidence_multiplier)
confidence            = clamp(evidence_density, 0, 1)
```

**Across-criteria** (overall):

```
overall_score      = round(sum(per_criterion[c].final * weight[c]))
overall_confidence = min(per_criterion[c].confidence)
self_check_span    = max - min across the 5 final scores  (must be ≥ 20)
```

Constraints: hard cap 5 items per perspective per criterion per pass; minimum target 3.

# AFTER AUTHORING

1. Show the user the list of benchmark folders you created.
2. Walk through each one's prompt + traps. Ask: *"Does this feel like your work? What's missing? What's too easy?"*
3. Iterate until they say "yes, this would tell me something I don't already know."
4. Mark the suite v0.1 in `benchmarks/_profile.md` and stop.

# IF YOU GET STUCK

- Generic answers? Ask for a real document or dataset.
- User resists the interview? Offer the alternative: "Paste me 3 recent deliverables and 1 recent disappointment, and I'll work backwards into the axes."
- Can't author with confidence? Ask one more question rather than fabricating.

# DONE

When the user is happy, write a short summary to chat:

- N benchmarks authored
- M capability axes covered
- K traps planted across the suite
- Recommended first run: which model × harness combos to test first

Then stop. The benchmark suite is the deliverable. Let the user run it.

B · Install as a skill via skills.sh

If your agent supports the skills ecosystem (Claude Code, Cursor, Goose, OpenCode, and many others), one command installs MyBench as a callable skill.

Lists at skills.sh/CodefiLabs/mybench/personal-benchmark.

Updates
Re-install pulls latest
License
MIT, fork freely

After the interview

Your benchmark suite lands in benchmarks/ in your working directory. Each folder has a prompt, input files, planted traps, and an evidence guide. Run a benchmark by handing the prompt to a runner (any model + any harness). Score it with the seven-principle method. Compare cells in your model × harness grid. Re-run weekly when a new model lands.

Built on the seven-principle scoring methodology from Don't Let the LLM Pick a Number. The methodology is calibrated on 18 hackathon submissions and 342 BLS occupations across 9 models. Inspired by Nate B. Jones' private benchmark approach. The pairwise scoring layer borrows from lechmazur/writing-style. Real artifact format-as-test is borrowed directly from Nate.