Read HVM_THREADS at startup: runtime worker count for the C runtime#441
Open
SiliconState wants to merge 1 commit into
Open
Read HVM_THREADS at startup: runtime worker count for the C runtime#441SiliconState wants to merge 1 commit into
SiliconState wants to merge 1 commit into
Conversation
ab5bdcd to
a6c01d7
Compare
The parallel C runtime fixed its worker count at compile time (TPC = 1 << TPC_L2), so parallel scaling could not be measured and cores could not be capped on shared machines. This reads the HVM_THREADS environment variable once at startup and clamps it to [1, TPC]; the thread pool, spin barrier, idle halt check and work-stealing ring all use the active count, while per-thread memory partitioning keeps the compiled TPC stride, so idle workers simply never start. Unset or invalid values keep the compiled TPC and byte-identical output. When the variable is set, a '- THREADS: N' stats line is printed so tools can confirm the count took effect. The steal ring uses (tid + n - 1) % n instead of the (tid - 1) % TPC unsigned-wrap trick, which only works for power-of-two counts. Measured on 16 logical cores (parallel_sum, ITRS 5898185 at every count): 1T 35.8 MIPS, 2T 54.0, 4T 87.0, 8T 120.8; HVM_THREADS=32 clamps to 16. Covers both and standalone gen-c output.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The parallel C runtime fixes its worker count at compile time (
TPC = 1 << TPC_L2). There is no way to choose the thread count at run time, so:Design
An environment variable
HVM_THREADS, read by the C runtime at startup. Minimal and fully backward-compatible: unset (or invalid) means exactly the current compile-timeTPCbehavior, and output stays byte-for-byte identical.HVM_THREADS=N→ useclamp(N, 1, TPC)active workers. Capping at the compiledTPCkeeps the existing fixed-size thread pool / per-thread net partitioning intact; a follow-up could allow growing pastTPC.hvm run) ignores it.Patch (one file:
src/hvm.c, +37/−7)Because
run.cdoes#include "hvm.c"andgen-cembeds thehvm.csource text, this single file covers bothhvm run-cand standalone gen-c programs.TPCdefine:hvm_tpc(active worker count) andhvm_tpc_from_env;hvm_tpc_init()readsgetenv("HVM_THREADS")viastrtol, rejecting unset/empty/trailing-garbage/< 1values and clamping to[1, TPC]. Called at the top ofhvm_c(), before any thread exists.TPCat the four places that mean "worker count": thesync_threads()spin-barrier target, theevaluator()idle-counter init and halt check, and thenormalize()spawn/join loops. Per-thread memory partitioning keeps the compiledTPCstride, so idle workers simply never start.(tm->tid - 1) % TPCrelies on unsigned wrap and is only correct for power-of-two counts; it becomes(tm->tid + hvm_tpc - 1) % hvm_tpc(verified at 3/5/7 threads).- THREADS: Nstats line is printed afterMIPS, so tools can confirm the count took effect. When unset, output is unchanged.Measurements
16 logical cores, WSL2,
parallel_sumdepth-18; ITRS = 5,898,185 at every count (same work, true scaling):Re-verified on this branch with
examples/sum_recat 1/3/7/16 threads: identical ITRS at every count,THREADS: Nreadback present when set, output unchanged when unset.cargo testpasses identically before and after the patch (the two snapshot failures on my machine are a pre-existing rustc panic-format artifact — newer rustc prints thread IDs in panic messages — present on pristinemain).Why this matters downstream
This unlocks a true parallel-efficiency curve for any benchmarking tool: run the same program with
HVM_THREADSin {1,2,4,8,…} and compare TIME at identical ITRS. A companion Bend PR adds a--threads Nconvenience flag that just sets this variable on the spawnedhvmprocess.