internal/bytealg: use NEON for compare_arm64.s large-input path#79800
internal/bytealg: use NEON for compare_arm64.s large-input path#79800mauri870 wants to merge 1 commit into
Conversation
Replace the scalar chunk loop with a NEON 64-byte/iter loop.
For inputs >= 64 bytes, each iteration loads 64 bytes per side with
VLD1.P into four 128-bit registers, compares with VCMEQ (.D2 treats
each 8-byte lane as a unit — any byte difference zeroes the whole lane),
chains the results with VAND, and reduces with VUMINV to a single byte
checked via CBZ for any mismatch.
On a mismatch the two pointers are rewound by 64 bytes and the existing
scalar chunk16 path locates the first differing byte-pair, which REV+CMP
converts to a lexicographic result. This overhead is paid at most once.
For inputs < 64 bytes the scalar chunk16 and byte-level tail are
unchanged. Structure now matches equal_arm64.s directly.
goos: darwin
goarch: arm64
pkg: bytes
cpu: Apple M3 Pro
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
CompareBytesBigUnaligned/offset=1-11 26.55µ ± 1% 19.21µ ± 1% -27.63% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=2-11 26.83µ ± 3% 19.11µ ± 1% -28.78% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=3-11 26.59µ ± 1% 19.28µ ± 2% -27.50% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=4-11 26.49µ ± 1% 19.11µ ± 1% -27.84% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=5-11 27.04µ ± 3% 19.05µ ± 1% -29.54% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=6-11 27.12µ ± 1% 18.92µ ± 1% -30.24% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=7-11 27.00µ ± 0% 18.90µ ± 0% -30.01% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=0-11 26.56µ ± 1% 18.01µ ± 1% -32.20% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=1-11 27.27µ ± 1% 19.43µ ± 0% -28.77% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=2-11 26.83µ ± 2% 18.68µ ± 1% -30.38% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=3-11 27.24µ ± 0% 19.45µ ± 0% -28.62% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=4-11 27.07µ ± 1% 18.67µ ± 1% -31.03% (p=0.000 n=7+10)
geomean 26.88µ 10.82µ -29.39%
│ old.txt │ new.txt │
│ B/s │ B/s vs base │
CompareBytesBigUnaligned/offset=1-11 36.78Gi ± 1% 50.82Gi ± 1% +38.18% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=2-11 36.40Gi ± 3% 51.11Gi ± 1% +40.41% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=3-11 36.73Gi ± 1% 50.66Gi ± 2% +37.93% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=4-11 36.87Gi ± 1% 51.09Gi ± 1% +38.59% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=5-11 36.12Gi ± 3% 51.27Gi ± 1% +41.92% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=6-11 36.00Gi ± 1% 51.61Gi ± 1% +43.35% (p=0.000 n=10)
CompareBytesBigUnaligned/offset=7-11 36.17Gi ± 0% 51.68Gi ± 0% +42.88% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=0-11 36.77Gi ± 1% 54.23Gi ± 1% +47.48% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=1-11 35.81Gi ± 1% 50.27Gi ± 0% +40.39% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=2-11 36.40Gi ± 2% 52.29Gi ± 1% +43.64% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=3-11 35.85Gi ± 0% 50.22Gi ± 0% +40.09% (p=0.000 n=10)
CompareBytesBigBothUnaligned/offset=4-11 36.07Gi ± 1% 52.30Gi ± 1% +45.00% (p=0.000 n=7+10)
geomean 36.33Gi 90.30Gi +41.63%
|
This PR (HEAD: a9950ce) has been imported to Gerrit for code review. Please visit Gerrit at https://go-review.googlesource.com/c/go/+/786560. Important tips:
|
|
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
|
Message from Mauri de Souza Meneguzzo: Patch Set 3: Commit-Queue+1 (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
|
Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com: Patch Set 3: Dry run: CV is trying the patch. Bot data: {"action":"start","triggered_at":"2026-06-06T03:01:47Z","revision":"da47cc8cd84fa89e0f686aae5a6e9415b0abe880"} Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
|
Message from Mauri de Souza Meneguzzo: Patch Set 3: -Commit-Queue (Performed by <GERRIT_ACCOUNT_60063> on behalf of <GERRIT_ACCOUNT_63983>) Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
|
Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com: Patch Set 3: This CL has passed the run Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
|
Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com: Patch Set 3: LUCI-TryBot-Result+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/786560. |
Replace the scalar chunk loop with a NEON 64-byte/iter loop.
For inputs >= 64 bytes, each iteration loads 64 bytes per side with VLD1.P into four 128-bit registers, compares with VCMEQ (any byte difference zeroes the whole lane), chains the results with VAND, and reduces with VUMINV to a single byte checked via CBZ for any mismatch.
On a mismatch the two pointers are rewound by 64 bytes and the existing scalar chunk16 path locates the first differing byte-pair, which REV+CMP converts to a lexicographic result. This overhead is paid at most once.
For inputs < 64 bytes the scalar chunk16 and byte-level tail are unchanged.