Skip to content

Read HVM_THREADS at startup: runtime worker count for the C runtime#441

Open
SiliconState wants to merge 1 commit into
HigherOrderCO:mainfrom
SiliconState:pr/hvm-threads
Open

Read HVM_THREADS at startup: runtime worker count for the C runtime#441
SiliconState wants to merge 1 commit into
HigherOrderCO:mainfrom
SiliconState:pr/hvm-threads

Conversation

@SiliconState

@SiliconState SiliconState commented Jun 10, 2026

Copy link
Copy Markdown

Problem

The parallel C runtime fixes its worker count at compile time (TPC = 1 << TPC_L2). There is no way to choose the thread count at run time, so:

  • You cannot measure parallel scaling (speedup vs N threads) — the single most useful number for a runtime whose headline feature is automatic parallelism.
  • Users on shared machines / CI can't cap cores.
  • Benchmarking and regression tooling can't pin a thread count for reproducibility.

Design

An environment variable HVM_THREADS, read by the C runtime at startup. Minimal and fully backward-compatible: unset (or invalid) means exactly the current compile-time TPC behavior, and output stays byte-for-byte identical.

  • HVM_THREADS=N → use clamp(N, 1, TPC) active workers. Capping at the compiled TPC keeps the existing fixed-size thread pool / per-thread net partitioning intact; a follow-up could allow growing past TPC.
  • The Rust interpreter (hvm run) ignores it.

Patch (one file: src/hvm.c, +37/−7)

Because run.c does #include "hvm.c" and gen-c embeds the hvm.c source text, this single file covers both hvm run-c and standalone gen-c programs.

  1. New globals next to the TPC define: hvm_tpc (active worker count) and hvm_tpc_from_env; hvm_tpc_init() reads getenv("HVM_THREADS") via strtol, rejecting unset/empty/trailing-garbage/< 1 values and clamping to [1, TPC]. Called at the top of hvm_c(), before any thread exists.
  2. The active count replaces TPC at the four places that mean "worker count": the sync_threads() spin-barrier target, the evaluator() idle-counter init and halt check, and the normalize() spawn/join loops. Per-thread memory partitioning keeps the compiled TPC stride, so idle workers simply never start.
  3. Steal-ring fix for non-power-of-two counts: (tm->tid - 1) % TPC relies on unsigned wrap and is only correct for power-of-two counts; it becomes (tm->tid + hvm_tpc - 1) % hvm_tpc (verified at 3/5/7 threads).
  4. When the env var was set and valid, an extra - THREADS: N stats line is printed after MIPS, so tools can confirm the count took effect. When unset, output is unchanged.

Measurements

16 logical cores, WSL2, parallel_sum depth-18; ITRS = 5,898,185 at every count (same work, true scaling):

HVM_THREADS TIME MIPS
1 0.16s 35.8
2 0.11s 54.0
4 0.07s 87.0
8 0.05s 120.8
16 0.06s 100.6
32 (clamps to 16) 0.05s 110.2

Re-verified on this branch with examples/sum_rec at 1/3/7/16 threads: identical ITRS at every count, THREADS: N readback present when set, output unchanged when unset.

cargo test passes identically before and after the patch (the two snapshot failures on my machine are a pre-existing rustc panic-format artifact — newer rustc prints thread IDs in panic messages — present on pristine main).

Why this matters downstream

This unlocks a true parallel-efficiency curve for any benchmarking tool: run the same program with HVM_THREADS in {1,2,4,8,…} and compare TIME at identical ITRS. A companion Bend PR adds a --threads N convenience flag that just sets this variable on the spawned hvm process.

The parallel C runtime fixed its worker count at compile time (TPC =
1 << TPC_L2), so parallel scaling could not be measured and cores could
not be capped on shared machines. This reads the HVM_THREADS environment
variable once at startup and clamps it to [1, TPC]; the thread pool,
spin barrier, idle halt check and work-stealing ring all use the active
count, while per-thread memory partitioning keeps the compiled TPC
stride, so idle workers simply never start.

Unset or invalid values keep the compiled TPC and byte-identical output.
When the variable is set, a '- THREADS: N' stats line is printed so
tools can confirm the count took effect. The steal ring uses
(tid + n - 1) % n instead of the (tid - 1) % TPC unsigned-wrap trick,
which only works for power-of-two counts.

Measured on 16 logical cores (parallel_sum, ITRS 5898185 at every
count): 1T 35.8 MIPS, 2T 54.0, 4T 87.0, 8T 120.8; HVM_THREADS=32
clamps to 16. Covers both  and standalone gen-c output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant