internal/bytealg: use NEON for compare_arm64.s large-input path by mauri870 · Pull Request #79800 · golang/go

mauri870 · 2026-06-03T12:31:30Z

Replace the scalar chunk loop with a NEON 64-byte/iter loop.

For inputs >= 64 bytes, each iteration loads 64 bytes per side with VLD1.P into four 128-bit registers, compares with VCMEQ (any byte difference zeroes the whole lane), chains the results with VAND, and reduces with VUMINV to a single byte checked via CBZ for any mismatch.

On a mismatch the two pointers are rewound by 64 bytes and the existing scalar chunk16 path locates the first differing byte-pair, which REV+CMP converts to a lexicographic result. This overhead is paid at most once.

For inputs < 64 bytes the scalar chunk16 and byte-level tail are unchanged.

 goos: darwin
 goarch: arm64
 pkg: bytes
 cpu: Apple M3 Pro
                                          │   old.txt   │                new.txt                │
                                          │   sec/op    │   sec/op     vs base                  │
 CompareBytesBigUnaligned/offset=1-11       26.55µ ± 1%   19.21µ ± 1%  -27.63% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       26.83µ ± 3%   19.11µ ± 1%  -28.78% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       26.59µ ± 1%   19.28µ ± 2%  -27.50% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       26.49µ ± 1%   19.11µ ± 1%  -27.84% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       27.04µ ± 3%   19.05µ ± 1%  -29.54% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       27.12µ ± 1%   18.92µ ± 1%  -30.24% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       27.00µ ± 0%   18.90µ ± 0%  -30.01% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   26.56µ ± 1%   18.01µ ± 1%  -32.20% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   27.27µ ± 1%   19.43µ ± 0%  -28.77% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   26.83µ ± 2%   18.68µ ± 1%  -30.38% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   27.24µ ± 0%   19.45µ ± 0%  -28.62% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   27.07µ ± 1%   18.67µ ± 1%  -31.03% (p=0.000 n=7+10)
 geomean                                    26.88µ        10.82µ       -29.39%

                                          │   old.txt    │                new.txt                 │
                                          │     B/s      │     B/s       vs base                  │
 CompareBytesBigUnaligned/offset=1-11       36.78Gi ± 1%   50.82Gi ± 1%  +38.18% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       36.40Gi ± 3%   51.11Gi ± 1%  +40.41% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       36.73Gi ± 1%   50.66Gi ± 2%  +37.93% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       36.87Gi ± 1%   51.09Gi ± 1%  +38.59% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       36.12Gi ± 3%   51.27Gi ± 1%  +41.92% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       36.00Gi ± 1%   51.61Gi ± 1%  +43.35% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       36.17Gi ± 0%   51.68Gi ± 0%  +42.88% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   36.77Gi ± 1%   54.23Gi ± 1%  +47.48% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   35.81Gi ± 1%   50.27Gi ± 0%  +40.39% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   36.40Gi ± 2%   52.29Gi ± 1%  +43.64% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   35.85Gi ± 0%   50.22Gi ± 0%  +40.09% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   36.07Gi ± 1%   52.30Gi ± 1%  +45.00% (p=0.000 n=7+10)
 geomean                                    36.33Gi        90.30Gi       +41.63%

Replace the scalar chunk loop with a NEON 64-byte/iter loop. For inputs >= 64 bytes, each iteration loads 64 bytes per side with VLD1.P into four 128-bit registers, compares with VCMEQ (.D2 treats each 8-byte lane as a unit — any byte difference zeroes the whole lane), chains the results with VAND, and reduces with VUMINV to a single byte checked via CBZ for any mismatch. On a mismatch the two pointers are rewound by 64 bytes and the existing scalar chunk16 path locates the first differing byte-pair, which REV+CMP converts to a lexicographic result. This overhead is paid at most once. For inputs < 64 bytes the scalar chunk16 and byte-level tail are unchanged. Structure now matches equal_arm64.s directly. goos: darwin goarch: arm64 pkg: bytes cpu: Apple M3 Pro │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ CompareBytesBigUnaligned/offset=1-11 26.55µ ± 1% 19.21µ ± 1% -27.63% (p=0.000 n=10) CompareBytesBigUnaligned/offset=2-11 26.83µ ± 3% 19.11µ ± 1% -28.78% (p=0.000 n=10) CompareBytesBigUnaligned/offset=3-11 26.59µ ± 1% 19.28µ ± 2% -27.50% (p=0.000 n=10) CompareBytesBigUnaligned/offset=4-11 26.49µ ± 1% 19.11µ ± 1% -27.84% (p=0.000 n=10) CompareBytesBigUnaligned/offset=5-11 27.04µ ± 3% 19.05µ ± 1% -29.54% (p=0.000 n=10) CompareBytesBigUnaligned/offset=6-11 27.12µ ± 1% 18.92µ ± 1% -30.24% (p=0.000 n=10) CompareBytesBigUnaligned/offset=7-11 27.00µ ± 0% 18.90µ ± 0% -30.01% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=0-11 26.56µ ± 1% 18.01µ ± 1% -32.20% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=1-11 27.27µ ± 1% 19.43µ ± 0% -28.77% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=2-11 26.83µ ± 2% 18.68µ ± 1% -30.38% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=3-11 27.24µ ± 0% 19.45µ ± 0% -28.62% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=4-11 27.07µ ± 1% 18.67µ ± 1% -31.03% (p=0.000 n=7+10) geomean 26.88µ 10.82µ -29.39% │ old.txt │ new.txt │ │ B/s │ B/s vs base │ CompareBytesBigUnaligned/offset=1-11 36.78Gi ± 1% 50.82Gi ± 1% +38.18% (p=0.000 n=10) CompareBytesBigUnaligned/offset=2-11 36.40Gi ± 3% 51.11Gi ± 1% +40.41% (p=0.000 n=10) CompareBytesBigUnaligned/offset=3-11 36.73Gi ± 1% 50.66Gi ± 2% +37.93% (p=0.000 n=10) CompareBytesBigUnaligned/offset=4-11 36.87Gi ± 1% 51.09Gi ± 1% +38.59% (p=0.000 n=10) CompareBytesBigUnaligned/offset=5-11 36.12Gi ± 3% 51.27Gi ± 1% +41.92% (p=0.000 n=10) CompareBytesBigUnaligned/offset=6-11 36.00Gi ± 1% 51.61Gi ± 1% +43.35% (p=0.000 n=10) CompareBytesBigUnaligned/offset=7-11 36.17Gi ± 0% 51.68Gi ± 0% +42.88% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=0-11 36.77Gi ± 1% 54.23Gi ± 1% +47.48% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=1-11 35.81Gi ± 1% 50.27Gi ± 0% +40.39% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=2-11 36.40Gi ± 2% 52.29Gi ± 1% +43.64% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=3-11 35.85Gi ± 0% 50.22Gi ± 0% +40.09% (p=0.000 n=10) CompareBytesBigBothUnaligned/offset=4-11 36.07Gi ± 1% 52.30Gi ± 1% +45.00% (p=0.000 n=7+10) geomean 36.33Gi 90.30Gi +41.63%

gopherbot · 2026-06-03T12:44:45Z

This PR (HEAD: a9950ce) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/786560.

Important tips:

Don't comment on this PR. All discussion takes place in Gerrit.
You need a Gmail or other Google account to log in to Gerrit.
To change your code in response to feedback:
- Push a new commit to the branch used by your GitHub PR.
- A new "patch set" will then appear in Gerrit.
- Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
- Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
- Multiple commits in the PR will be squashed by GerritBot.
The title and description of the GitHub PR are used to construct the final commit message.
- Edit these as needed via the GitHub web interface (not via Gerrit or git).
- You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

gopherbot · 2026-06-03T13:19:17Z

Message from Gopher Robot:

Patch Set 1:

(1 comment)