ArchEval Submit →

ArchEval calls for challenges.

ArchEval is a benchmark for AI agents doing real computer-architecture research — designing cache replacement policies, branch predictors, prefetchers; reasoning about which CPU design wins for a given workload. We're opening it to the community: contribute a challenge that a language model cannot solve by recalling something it saw during training.

Section 01 / 05

What we look for

Three principles. They're how a maintainer judges an incoming submission at a glance. Behind the scenes each one expands into a short checklist (eleven items in total). You don't need to memorise the checklist — when you click Run automated reviewer on the form below, an LLM rates your draft against every item and tells you, point by point, what's missing.

An agentic design question.

Ideally hard. Ideally unsolved.

A real research question an autonomous agent can attack — not a multiple-choice question, not "implement algorithm X from paper Y." If a graduate student could find the answer in published papers, blogs, or vendor whitepapers, a language model has already read it.

A self-contained workspace.

Everything the agent needs ships with the challenge.

Starter code, simulator binaries (or container image), traces, configs, task description — all bundled in, or linked with a working download URL for things too large to ship inline. The agent starts with a complete working environment; nothing on the host matters.

A clearly specified evaluation.

Simulator, metric, baseline — written down.

One simulation script the agent calls (and cannot edit). One scalar number it produces. A reference baseline to compare against. Reproducible across runs, and short enough that the agent can iterate — target 10–30 minutes per evaluation; one hour is a hard ceiling.

Section 02 / 05

What you get back

Contributing a problem we accept into ArchEval gets you:

(TODO: reward_1_title)

(TODO: reward_1_body — e.g. "Co-author credit on the ArchEval benchmark paper and the public leaderboard.")

(TODO: reward_2_title)

(TODO: reward_2_body — e.g. "A named contributor card on the public site, linked from every run that uses your challenge.")

(TODO: reward_3_title)

(TODO: reward_3_body — e.g. "Early access to the next round of frontier-model results on the challenges you helped design.")

Placeholder — the lab is finalizing the contributor-recognition policy. This section will be filled in before public launch.

Section 03 / 05

An existing challenge, for reference

Below is LLC Replacement: IPC Improvement over LRU — the cache-replacement challenge already running on ArchEval, and the one this submission form was modeled on. The agent has to design a last-level cache replacement policy under a 4 KB metadata budget and beat LRU's IPC on a SPEC CPU2006 trace.

Every ArchEval challenge is defined by one YAML file we call challenge.yaml. The form below builds that file for you, section by section. Inside each form section you'll find a collapsed snippet — that's the matching slice from this challenge's real challenge.yaml, so you can see what your answers turn into. You don't need to read or write any YAML.

At a glance

simulator ChampSim v6 (trace-driven, cycle-level x86 CPU)

workload SPEC CPU2006 482.sphinx3 (speech recognition)

metric IPC (higher is better)

baseline LRU → IPC 0.6968 · theoretical limit at 4 KB ≈ 0.7344

constraint 4 KB metadata budget, no heap allocation, Clang AST bit-width check

latency ≈ 30 min per simulation; 5 simulations per submission

View this challenge's challenge.yaml

Public slice of the file. Maintainer-only blocks (which AI runtime is wired up, where its credentials live, image tags) are kept private.

id: cache_replacement
name: "LLC Replacement: IPC Improvement over LRU"
difficulty: hard
simulator: champsim
category: compete


storage_limit_bytes: 4096
storage_check_script: check_storage.py


prompt: |
  <task>
  Design an LLC replacement policy in /workspace/archeval_discovery.{h,cc}
  that maximizes IPC under a {storage_limit_kb} KB metadata budget.
  Iterate via submit() until time runs out.
  </task>

  <environment>
  Code to write: /workspace/archeval_discovery.{h,cc}
  API docs: /api/ (start with reference.md and quickstart.md)
  LRU reference: /workspace/starter/ (example showing the API, may exceed your budget)
  Traces: /traces/decoded/ (tab-separated, ~200K instructions per trace,
    columns: idx, ip, is_branch, branch_taken, dst_regs, src_regs, dst_mem, src_mem)
  </environment>

  <constraints>
  Physical hardware SRAM design — not software.
  1. NO dynamic allocation (vector, map, list, new, malloc FORBIDDEN)
  2. Use std::array and bitfields (e.g. uint32_t state : 3)
  3. Clang AST bit-width check — only declared widths count
  4. Self-audit: `python3 {storage_check_script} .`
  5. Budget: {storage_limit_bytes} bytes ({storage_limit_kb} KB)
     LLC_SETS and LLC_WAYS provided as compile-time macros
  </constraints>

  <evaluation>
  Each submit reports IPC speedup vs LRU. No threshold — maximize IPC.
  You have {max_submissions} simulation submissions. Compilation failures
  and storage check failures do NOT count — only completed simulations.
  </evaluation>


# The agent runtimes that execute this challenge (which AI, which tools,
# how each model is wired up to the simulator) are kept private by
# maintainers and are not shown here.


simulator_config:
  script: simulate.sh
  warmup: 100000000
  simulation: 500000000
  traces:
    - 482.sphinx3-1100B.champsimtrace.xz
  component_dir: replacement
  component_name: archeval_discovery

eval:
  metric: ipc
  direction: higher_is_better
  threshold: 999.0
  type: rank
  max_submissions: 5
  max_code_lines: 1000
  baseline: baseline.json
  reference:
    lru_baseline: 0.6968
    theoretical_limits:
      "4k":  0.7344
      "8k":  0.7344
      "16k": 0.7410
      "32k": 0.8069
  baselines:
    lru:       0.6968
    ship_rrip: 0.7344   # SHiP+RRIP performance at 16k+

input:
  starter_files:
    - archeval_discovery.h
    - archeval_discovery.cc
output:
  files:
    - archeval_discovery.h
    - archeval_discovery.cc

source_blocklist:
  - "/archeval/runtimes/champsim/replacement/*"
Section 04 / 05

Submit a challenge

Five short sections. You're filling in the equivalent of a challenge.yaml file — each form section names the YAML field it maps to in parentheses, in case you've already seen one. You don't need to read or write any YAML; the form does that for you. Optional "Show how this looks in our example" panels reveal the corresponding slice from the reference challenge above, if you want a concrete look. Your draft saves to this browser as you type.

Only fields marked with * are required to submit: title, category, what the agent does, your name, email. The rest is encouraged — fill what you can, leave the rest blank, we'll follow up.

Where each button sends your work.
  • Submit to ArchEval — POSTs your form (and any uploaded files) directly to the ArchEval maintainers' server. Not to GitHub. You'll get a submission ID back and a maintainer will email you. Files are kept on our server until reviewed and then archived.
  • Run automated reviewer — private self-check only. Your draft is sent to a Claude Haiku model for rubric feedback. Nothing is stored on our side.
Want to see this filled in? Load the reference cache-replacement challenge into the form below.

New to ArchEval? Start with the interviewer — it'll ask one question at a time and fill in the form for you to review at the end.

ArchEval interviewer

Powered by Claude Haiku via vectorengine.ai. The conversation stays in your browser; the maintainers don't see it. When the interviewer has enough, the form below gets filled in for you.

Shift+Enter for newline · Enter to send

Submit records your draft and opens a public PR with the LLM reviewer's critique attached, so the community can discuss it on GitHub. Self-check is private rubric feedback only — nothing leaves your browser except the text of your draft, which is sent to the reviewing language model and not stored. Submit anyway does the same thing as Submit but adds a "submitter disagrees with the verdict" flag to the PR — use it when you think the LLM got it wrong.

Section 05 / 05

Got feedback on this site, the rubric, or the benchmark?

Drop a note in the form below. We read every entry. Comments on the form itself, on what the automated reviewer got wrong, on missing categories, on rewards we should offer, or just on whether you'd personally contribute — all welcome.

temporary

The maintainers are still publishing the Google Form. In the meantime, email chenyu_wang@seas.harvard.edu with the subject ArchEval feedback, and we'll fold it in.