ArchEval calls for challenges.

ArchEval is a benchmark for AI agents doing real computer-architecture research — designing cache replacement policies, branch predictors, prefetchers; reasoning about which CPU design wins for a given workload. We're opening it to the community: contribute a challenge that a language model cannot solve by recalling something it saw during training.

Submit a challenge →

or first scroll to what we look for / an existing challenge.

Section 01 / 05

What we look for

Three principles. They're how a maintainer judges an incoming submission at a glance. Behind the scenes each one expands into a short checklist (eleven items in total). You don't need to memorise the checklist — when you click Run automated reviewer on the form below, an LLM rates your draft against every item and tells you, point by point, what's missing.

An agentic design question.

Ideally hard. Ideally unsolved.

A real research question an autonomous agent can attack — not a multiple-choice question, not "implement algorithm X from paper Y." If a graduate student could find the answer in published papers, blogs, or vendor whitepapers, a language model has already read it.

A self-contained workspace.

Everything the agent needs ships with the challenge.

Starter code, simulator binaries (or container image), traces, configs, task description — all bundled in, or linked with a working download URL for things too large to ship inline. The agent starts with a complete working environment; nothing on the host matters.

A clearly specified evaluation.

Simulator, metric, baseline — written down.

One simulation script the agent calls (and cannot edit). One scalar number it produces. A reference baseline to compare against. Reproducible across runs, and short enough that the agent can iterate — target 10–30 minutes per evaluation; one hour is a hard ceiling.

Section 02 / 05

What you get back

Contributing a problem we accept into ArchEval gets you:

(TODO: reward_1_title)

(TODO: reward_1_body — e.g. "Co-author credit on the ArchEval benchmark paper and the public leaderboard.")

(TODO: reward_2_title)

(TODO: reward_2_body — e.g. "A named contributor card on the public site, linked from every run that uses your challenge.")

(TODO: reward_3_title)

(TODO: reward_3_body — e.g. "Early access to the next round of frontier-model results on the challenges you helped design.")

Placeholder — the lab is finalizing the contributor-recognition policy. This section will be filled in before public launch.

Section 03 / 05

An existing challenge, for reference

Below is LLC Replacement: IPC Improvement over LRU — the cache-replacement challenge already running on ArchEval, and the one this submission form was modeled on. The agent has to design a last-level cache replacement policy under a 4 KB metadata budget and beat LRU's IPC on a SPEC CPU2006 trace.

Every ArchEval challenge is defined by one YAML file we call challenge.yaml. The form below builds that file for you, section by section. Inside each form section you'll find a collapsed snippet — that's the matching slice from this challenge's real challenge.yaml, so you can see what your answers turn into. You don't need to read or write any YAML.

At a glance

simulator ChampSim v6 (trace-driven, cycle-level x86 CPU)

workload SPEC CPU2006 482.sphinx3 (speech recognition)

metric IPC (higher is better)

baseline LRU → IPC 0.6968 · theoretical limit at 4 KB ≈ 0.7344

constraint 4 KB metadata budget, no heap allocation, Clang AST bit-width check

latency ≈ 30 min per simulation; 5 simulations per submission

View this challenge's challenge.yaml ↓

Public slice of the file. Maintainer-only blocks (which AI runtime is wired up, where its credentials live, image tags) are kept private.

id: cache_replacement
name: "LLC Replacement: IPC Improvement over LRU"
difficulty: hard
simulator: champsim
category: compete


storage_limit_bytes: 4096
storage_check_script: check_storage.py


prompt: |
  <task>
  Design an LLC replacement policy in /workspace/archeval_discovery.{h,cc}
  that maximizes IPC under a {storage_limit_kb} KB metadata budget.
  Iterate via submit() until time runs out.
  </task>

  <environment>
  Code to write: /workspace/archeval_discovery.{h,cc}
  API docs: /api/ (start with reference.md and quickstart.md)
  LRU reference: /workspace/starter/ (example showing the API, may exceed your budget)
  Traces: /traces/decoded/ (tab-separated, ~200K instructions per trace,
    columns: idx, ip, is_branch, branch_taken, dst_regs, src_regs, dst_mem, src_mem)
  </environment>

  <constraints>
  Physical hardware SRAM design — not software.
  1. NO dynamic allocation (vector, map, list, new, malloc FORBIDDEN)
  2. Use std::array and bitfields (e.g. uint32_t state : 3)
  3. Clang AST bit-width check — only declared widths count
  4. Self-audit: `python3 {storage_check_script} .`
  5. Budget: {storage_limit_bytes} bytes ({storage_limit_kb} KB)
     LLC_SETS and LLC_WAYS provided as compile-time macros
  </constraints>

  <evaluation>
  Each submit reports IPC speedup vs LRU. No threshold — maximize IPC.
  You have {max_submissions} simulation submissions. Compilation failures
  and storage check failures do NOT count — only completed simulations.
  </evaluation>


# The agent runtimes that execute this challenge (which AI, which tools,
# how each model is wired up to the simulator) are kept private by
# maintainers and are not shown here.


simulator_config:
  script: simulate.sh
  warmup: 100000000
  simulation: 500000000
  traces:
    - 482.sphinx3-1100B.champsimtrace.xz
  component_dir: replacement
  component_name: archeval_discovery

eval:
  metric: ipc
  direction: higher_is_better
  threshold: 999.0
  type: rank
  max_submissions: 5
  max_code_lines: 1000
  baseline: baseline.json
  reference:
    lru_baseline: 0.6968
    theoretical_limits:
      "4k":  0.7344
      "8k":  0.7344
      "16k": 0.7410
      "32k": 0.8069
  baselines:
    lru:       0.6968
    ship_rrip: 0.7344   # SHiP+RRIP performance at 16k+

input:
  starter_files:
    - archeval_discovery.h
    - archeval_discovery.cc
output:
  files:
    - archeval_discovery.h
    - archeval_discovery.cc

source_blocklist:
  - "/archeval/runtimes/champsim/replacement/*"

Section 04 / 05

Submit a challenge

Five short sections. You're filling in the equivalent of a challenge.yaml file — each form section names the YAML field it maps to in parentheses, in case you've already seen one. You don't need to read or write any YAML; the form does that for you. Optional "Show how this looks in our example" panels reveal the corresponding slice from the reference challenge above, if you want a concrete look. Your draft saves to this browser as you type.

Only fields marked with * are required to submit: title, category, what the agent does, your name, email. The rest is encouraged — fill what you can, leave the rest blank, we'll follow up.

Where each button sends your work.

Submit to ArchEval — POSTs your form (and any uploaded files) directly to the ArchEval maintainers' server. Not to GitHub. You'll get a submission ID back and a maintainer will email you. Files are kept on our server until reviewed and then archived.
Run automated reviewer — private self-check only. Your draft is sent to a Claude Haiku model for rubric feedback. Nothing is stored on our side.

Want to see this filled in? Load the reference cache-replacement challenge into the form below.

New to ArchEval? Start with the interviewer — it'll ask one question at a time and fill in the form for you to review at the end.

ArchEval interviewer

Powered by Claude Haiku via vectorengine.ai. The conversation stays in your browser; the maintainers don't see it. When the interviewer has enough, the form below gets filled in for you.

Shift+Enter for newline · Enter to send

Section A · maps to prompt:

Task prompt

This is what the agent reads at the start of each run — its instructions and a description of the workspace it lives in. In the agent's prompt, submit() refers to the API call the agent makes to compile its code and run one simulation; you don't write that — we wire it up.

Show this section filled in for the cache-replacement reference challenge ↓

Below is the resulting challenge.yaml excerpt — what the form turns into:

prompt: |
  <task>
  Design an LLC replacement policy in /workspace/archeval_discovery.{h,cc}
  that maximizes IPC under a {storage_limit_kb} KB metadata budget.
  Iterate via submit() until time runs out.
  </task>

  <environment>
  Code to write: /workspace/archeval_discovery.{h,cc}
  API docs: /api/ (start with reference.md and quickstart.md)
  LRU reference: /workspace/starter/ (example showing the API, may exceed your budget)
  Traces: /traces/decoded/ (tab-separated, ~200K instructions per trace,
    columns: idx, ip, is_branch, branch_taken, dst_regs, src_regs, dst_mem, src_mem)
  </environment>

  <constraints>
  Physical hardware SRAM design — not software.
  1. NO dynamic allocation (vector, map, list, new, malloc FORBIDDEN)
  2. Use std::array and bitfields (e.g. uint32_t state : 3)
  3. Clang AST bit-width check — only declared widths count
  4. Self-audit: `python3 {storage_check_script} .`
  5. Budget: {storage_limit_bytes} bytes ({storage_limit_kb} KB)
     LLC_SETS and LLC_WAYS provided as compile-time macros
  </constraints>

  <evaluation>
  Each submit reports IPC speedup vs LRU. No threshold — maximize IPC.
  You have {max_submissions} simulation submissions. Compilation failures
  and storage check failures do NOT count — only completed simulations.
  </evaluation>

Challenge title *

Workspace setup

What does the agent start with? Code skeleton (if any), simulator configs, helper scripts, task descriptions, workload traces — every file the agent reads or runs to attempt the challenge. Get it all to us via one of these three ways:

Zip it up and upload here — fine for the small stuff (starter code, configs, task notes; under ~100 MB).
Paste a public download link — Google Drive, Dropbox, S3 bucket, GitHub release, plain HTTP. Best for instruction traces, large datasets, container images. Strongly encouraged when anything is over ~100 MB — most architecture workloads are. We need to fetch it the same way the agent will, so reproducibility is preserved.
A wget / curl one-liner we can run — paste it in the URL field. Useful if the asset lives on a public mirror or behind a known CDN.

Show this section filled in for the cache-replacement reference challenge ↓

Below is the resulting challenge.yaml excerpt — what the form turns into:

input:
  starter_files:
    - archeval_discovery.h
    - archeval_discovery.cc
output:
  files:
    - archeval_discovery.h
    - archeval_discovery.cc
source_blocklist:
  - "/archeval/runtimes/champsim/replacement/*"

Upload a zip (or several small files)

Hard cap ~100 MB total upload. For traces and big datasets, skip this and use the URL field below — much faster and more reproducible.

Public download URL or wget/curl command (strongly encouraged for traces / large files)

Accepts a Drive/Dropbox/S3/GitHub-release link, or any plain HTTP URL, or a one-line shell command we can run. Make sure it doesn't require auth — we should be able to fetch it from a clean machine.

Which files the agent writes as its solution

The output files the agent produces. The agent may be writing from scratch — there does not have to be a starter file to edit.

Which files / paths the agent must NOT be able to edit

This prevents the agent from editing the simulator or its config to inflate its score — only the agent's own solution files are writable.

Section C · maps to simulator_config:, eval:

Evaluation

The simulator, the metric, the baseline. Don't worry about choosing the model the agent runs on or wiring up tools — we fill in the agent: block (model choice, tool plumbing, prompt scaffolding) ourselves once your challenge is accepted. Just give us enough simulator and metric detail that we can.

Show this section filled in for the cache-replacement reference challenge ↓

Below is the resulting challenge.yaml excerpt — what the form turns into:

simulator_config:
  script: simulate.sh
  warmup: 100000000
  simulation: 500000000
  traces:
    - 482.sphinx3-1100B.champsimtrace.xz
  component_dir: replacement
  component_name: archeval_discovery

eval:
  metric: ipc
  direction: higher_is_better
  max_submissions: 5
  max_code_lines: 1000
  baseline: baseline.json
  reference:
    lru_baseline: 0.6968
    theoretical_limits:
      "4k":  0.7344
      "8k":  0.7344
      "16k": 0.7410
      "32k": 0.8069
  baselines:
    lru:       0.6968
    ship_rrip: 0.7344   # SHiP+RRIP performance at 16k+

Simulator name and version

If a container image already exists, paste its tag here too.

Workload / trace and where it comes from

Primary metric

Direction

Higher is better Lower is better

Reference baseline (optional, but strongly encouraged)

Helps both the agent and reviewers tell whether a result is good. Description alone is fine; a measured value is better; working baseline code is best. If you don't have a baseline yet, leave blank — we can build one together after the proposal is accepted.

Estimated wall time per simulation (min)

How long one full simulation takes end-to-end on a single workstation. Target 10–30; hard ceiling 60. (How many simulations the agent gets per attempt is set by maintainers.)

Is it possible to shorten the simulation?

Optional. Shorter iterations let agents try more designs; matters most when wall time is approaching 60 min.

Section D

Constraints & trade-offs

Without an explicit constraint, agents find degenerate solutions.

For example: ask the agent to minimize branch-MPKI on a branch target buffer without capping its size, and it will simply propose an "unlimited" buffer — perfect score, no design happened. A challenge is only interesting when there's a real budget the agent has to spend wisely. Tell us what the budget is and what trade-off it forces.

Show how this looks in our example ↓

storage_limit_bytes: 4096
storage_check_script: check_storage.py

# the user-visible <constraints> block in the prompt above:
#   1. NO dynamic allocation (vector, map, list, new, malloc FORBIDDEN)
#   2. Use std::array and bitfields (e.g. uint32_t state : 3)
#   3. Clang AST bit-width check — only declared widths count
#   4. Self-audit script runs before each simulation
#   5. Budget: 4096 bytes (4 KB)

The constraint(s)

Resource budgets: storage bytes, simulated cycles, area, energy, … and how they're enforced.

What trade-off does this constraint force the agent into?

Without this constraint, what's the trivial / degenerate solution?

If you can't name a trivial solution, the constraint isn't biting — go back and tighten it.

Section E

About you

That's everything about the challenge. We just need to know who you are so a maintainer can follow up.

Your name

Affiliation (optional)

Anything you'd like feedback on (optional)

Submit records your draft and opens a public PR with the LLM reviewer's critique attached, so the community can discuss it on GitHub. Self-check is private rubric feedback only — nothing leaves your browser except the text of your draft, which is sent to the reviewing language model and not stored. Submit anyway does the same thing as Submit but adds a "submitter disagrees with the verdict" flag to the PR — use it when you think the LLM got it wrong.

Section 05 / 05

Got feedback on this site, the rubric, or the benchmark?

Drop a note in the form below. We read every entry. Comments on the form itself, on what the automated reviewer got wrong, on missing categories, on rewards we should offer, or just on whether you'd personally contribute — all welcome.

temporary

The maintainers are still publishing the Google Form. In the meantime, email chenyu_wang@seas.harvard.edu with the subject ArchEval feedback, and we'll fold it in.