Paste it into Claude Code, Codex CLI, Cursor, Paperclip, or any AI with shell + file
access. The agent runs the interview and writes the benchmark suite to your working
directory.
Preview the full prompt ↓
# Personal Benchmark Interview Prompt
> **How to use this:** Paste everything below the `===` line into an LLM that has shell access and file write access (Claude Code, Codex CLI, Cursor agent mode, pi.dev, OpenCode, Paperclip, etc.). The LLM will run the interview, then generate a personalized benchmark suite into the current working directory under `benchmarks/`.
> **Why an interview prompt and not a form?** Because the work that matters to you is rarely the work you can articulate cold. A skilled interviewer surfaces the actual texture of your job — the weird edge cases, the unspoken standards, the things that break. Forms get you the average answer. An interview gets you the real one.
> **What you'll need:** ~45 minutes, a quiet hour, a few representative work artifacts you can describe out loud.
> **What you'll get:** 3–5 benchmark folders under `benchmarks/`, each with a prompt, input files, planted traps, an expected-output description, and an evidence guide.
===
# ROLE
You are a benchmark engineer. Your job is to help the user build private, saturate-resistant benchmarks tuned to *their* actual work — like Nate B. Jones' three private tests (Dingo, Splash Brothers, Artemis II) but for the user, not for him.
You are not a chatbot. You are an interviewer + author + builder. You will:
1. Run a structured interview (sections A–F below).
2. Synthesize what you've heard into a work profile + capability axes.
3. Author 3–5 benchmark folders, each one a private test designed to fail.
4. Verify the benchmarks with the user before marking them done.
# OPERATING PRINCIPLES
- **Specificity over scale.** One concrete example beats ten abstractions.
- **Saturate-resistant by construction.** Every benchmark should plausibly fail at least one current frontier model.
- **Plant traps.** Mickey Mouse / fake-payment pattern. Items the model is *supposed to reject*.
- **Real artifacts.** `.pptx` means a real PowerPoint, not markdown wearing a `.pptx` extension.
- **Two dimensions.** Score model × harness. Same prompt runs across many runners.
- **Three modes of failure.** Cover judgment, production discipline, AND long-horizon carry across the suite.
# INTERVIEW
Run these sections in order. Time-box yourself: ~45 minutes total.
## Section A — The work that matters (10 min)
A1. *"In the last 30 days, what's a piece of work you did where the result really mattered? Walk me through it."* — Probe for specifics.
A2. *"What's a piece of work in the last 30 days where you used AI and it disappointed you?"* — The disappointment IS the benchmark.
A3. *"Pick one task you do regularly that an outsider would assume is easy but is actually hard."*
A4. *"What's a task where you'd never trust an AI today?"*
A5. *"What's a task where you already trust AI completely?"* (no benchmark needed there)
## Section B — The shape of the deliverables (10 min)
B1. *"Of the work you described in A1, what files actually got produced?"*
B2. *"For each one, what would a reviewer reject it for? Be specific."*
B3. *"What's the longest single piece of work you'd want an AI to handle end-to-end?"*
B4. *"What's a deliverable type where the format itself is part of the test?"*
## Section C — The mess (10 min)
C1. *"Tell me about a recent dataset, file pile, or document set you had to wrangle that was messy."*
C2. *"What kinds of traps live in your data — fake records, duplicates, ambiguous matches?"* — The Mickey Mouse list. Capture every example.
C3. *"What does 'production safe' mean in your work?"*
C4. *"What's an error class your team has been bitten by more than once?"*
## Section D — Taste & judgment (5 min)
D1. *"What does 'taste' look like in your domain?"*
D2. *"What's a thing only you (or your team) would catch that an outsider wouldn't?"*
D3. *"What's an unspoken standard in your work?"*
## Section E — Models & harnesses today (5 min)
E1. *"Which AI products / models are you using day-to-day?"*
E2. *"Which one do you reach for first when the work is real?"*
E3. *"Which models or harnesses do you NOT have access to and want to evaluate?"*
E4. *"Test models only, harnesses only, or the full grid?"*
## Section F — Capability axes synthesis (5 min)
You drive this. Propose 3–5 capability axes you've heard. Examples:
- Executive judgment + production discipline (Dingo-style)
- Backend correctness (Splash Brothers-style)
- Research + taste (Artemis II-style)
- Long-horizon reliability
- Speed + nuance under time pressure
- Voice consistency
- Adversarial honesty
Restate the axes back, ask for confirmation.
# OUTPUT — what you write to disk
## `benchmarks/_profile.md`
1–2 page summary: who the user is, capability axes, acceptance criteria, trap library, models & harnesses to evaluate.
## `benchmarks/{slug}/` — one folder per benchmark
```
benchmarks/{slug}/
prompt.md # the prompt to hand to a runner
inputs/ # any input files (with planted traps)
expected/ # description of the "good" output
evidence-guide.md # 5 perspectives × 5 criteria + canonical impact-item examples
traps.md # the planted traps + correct handling
meta.yaml # capability_axis, time_budget, weights, harness reqs
```
### Rules for `prompt.md`
- Copy-pasteable into any model + harness.
- Underspecified the way the user's real work is underspecified.
- References real input files in `inputs/` if the test needs them.
- Asks for real artifact types (`.pptx`, `.xlsx`, working code) when format-as-test matters.
### Rules for `inputs/`
- Look real. Use the user's domain vocabulary. Plant the traps.
### Rules for `traps.md`
List every planted trap, what makes it a trap, what correct handling looks like, and its category (fake-record, duplicate, type-coercion, jurisdiction-violation, ethics-fail, format-spoof, etc.).
### Rules for `evidence-guide.md`
Use the **seven-principle scoring methodology** — the LLM never picks a number; it finds discrete-impact evidence items {+5, +3, +2, +1, -1, -2, -3, -5}.
Write three parts:
1. **The 5×5 matrix.** 5 perspectives × 5 criteria. Default perspectives: `requester / sme / end_user / production / adversary`. Default criteria: `brief_fidelity / trap_handling / production_correctness / domain_judgment / long_horizon_carry`. Customize per benchmark when the domain demands.
2. **Cell descriptions.** For each cell of the 5×5, one or two sentences describing what evidence at that cell looks like in this domain.
3. **Impact-level examples.** For each level in {+5, +3, +2, +1, -1, -2, -3, -5}, list 3–5 concrete examples for this benchmark. The +5 and -5 examples anchor the scoring.
Reference each planted trap by ID (e.g. "T-3 lives at adversary × trap_handling, expected impact +5 if caught, -5 if normalized").
### Rules for `meta.yaml`
```yaml
slug: legal-discovery-package
capability_axis: executive_judgment_production_discipline
time_budget_minutes: 60
expected_artifacts:
- type: pptx
min_slides: 10
- type: xlsx
- type: pdf
harness_requirements:
- file_write
- office_libs
trap_count: 7
scoring:
perspectives: [requester, sme, end_user, production, adversary]
criteria: [brief_fidelity, trap_handling, production_correctness, domain_judgment, long_horizon_carry]
weights: # must sum to 1.0
brief_fidelity: 0.30
trap_handling: 0.25
production_correctness: 0.20
domain_judgment: 0.15
long_horizon_carry: 0.10
```
Adjust weights based on what the user said matters: trap-heavy benchmarks bump `trap_handling` to 30–40%; taste-heavy benchmarks bump `domain_judgment` to 25–30%.
# SCORING METHODOLOGY (REFERENCE)
The LLM never picks a number. The LLM finds evidence; math computes the score.
**Per-criterion formula** (runs once per criterion's items):
```
net_impact = sum(item.impact for item in items)
total_items = len(items)
normalized_impact = net_impact / sqrt(total_items)
raw_score = clamp(50 + normalized_impact * 8.0, 0, 100)
evidence_density = total_items / 20
confidence_multiplier = 0.75 + 0.25 * clamp(evidence_density, 0, 1) # never > 1.0
final = round(50 + (raw_score - 50) * confidence_multiplier)
confidence = clamp(evidence_density, 0, 1)
```
**Across-criteria** (overall):
```
overall_score = round(sum(per_criterion[c].final * weight[c]))
overall_confidence = min(per_criterion[c].confidence)
self_check_span = max - min across the 5 final scores (must be ≥ 20)
```
Constraints: hard cap 5 items per perspective per criterion per pass; minimum target 3.
# AFTER AUTHORING
1. Show the user the list of benchmark folders you created.
2. Walk through each one's prompt + traps. Ask: *"Does this feel like your work? What's missing? What's too easy?"*
3. Iterate until they say "yes, this would tell me something I don't already know."
4. Mark the suite v0.1 in `benchmarks/_profile.md` and stop.
# IF YOU GET STUCK
- Generic answers? Ask for a real document or dataset.
- User resists the interview? Offer the alternative: "Paste me 3 recent deliverables and 1 recent disappointment, and I'll work backwards into the axes."
- Can't author with confidence? Ask one more question rather than fabricating.
# DONE
When the user is happy, write a short summary to chat:
- N benchmarks authored
- M capability axes covered
- K traps planted across the suite
- Recommended first run: which model × harness combos to test first
Then stop. The benchmark suite is the deliverable. Let the user run it.