Skip to content

internal/bytealg: use NEON for compare_arm64.s large-input path#79800

Open
mauri870 wants to merge 1 commit into
golang:masterfrom
mauri870:neon-compare
Open

internal/bytealg: use NEON for compare_arm64.s large-input path#79800
mauri870 wants to merge 1 commit into
golang:masterfrom
mauri870:neon-compare

Conversation

@mauri870
Copy link
Copy Markdown
Member

@mauri870 mauri870 commented Jun 3, 2026

Replace the scalar chunk loop with a NEON 64-byte/iter loop.

For inputs >= 64 bytes, each iteration loads 64 bytes per side with VLD1.P into four 128-bit registers, compares with VCMEQ (any byte difference zeroes the whole lane), chains the results with VAND, and reduces with VUMINV to a single byte checked via CBZ for any mismatch.

On a mismatch the two pointers are rewound by 64 bytes and the existing scalar chunk16 path locates the first differing byte-pair, which REV+CMP converts to a lexicographic result. This overhead is paid at most once.

For inputs < 64 bytes the scalar chunk16 and byte-level tail are unchanged.

 goos: darwin
 goarch: arm64
 pkg: bytes
 cpu: Apple M3 Pro
                                          │   old.txt   │                new.txt                │
                                          │   sec/op    │   sec/op     vs base                  │
 CompareBytesBigUnaligned/offset=1-11       26.55µ ± 1%   19.21µ ± 1%  -27.63% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       26.83µ ± 3%   19.11µ ± 1%  -28.78% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       26.59µ ± 1%   19.28µ ± 2%  -27.50% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       26.49µ ± 1%   19.11µ ± 1%  -27.84% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       27.04µ ± 3%   19.05µ ± 1%  -29.54% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       27.12µ ± 1%   18.92µ ± 1%  -30.24% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       27.00µ ± 0%   18.90µ ± 0%  -30.01% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   26.56µ ± 1%   18.01µ ± 1%  -32.20% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   27.27µ ± 1%   19.43µ ± 0%  -28.77% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   26.83µ ± 2%   18.68µ ± 1%  -30.38% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   27.24µ ± 0%   19.45µ ± 0%  -28.62% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   27.07µ ± 1%   18.67µ ± 1%  -31.03% (p=0.000 n=7+10)
 geomean                                    26.88µ        10.82µ       -29.39%

                                          │   old.txt    │                new.txt                 │
                                          │     B/s      │     B/s       vs base                  │
 CompareBytesBigUnaligned/offset=1-11       36.78Gi ± 1%   50.82Gi ± 1%  +38.18% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       36.40Gi ± 3%   51.11Gi ± 1%  +40.41% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       36.73Gi ± 1%   50.66Gi ± 2%  +37.93% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       36.87Gi ± 1%   51.09Gi ± 1%  +38.59% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       36.12Gi ± 3%   51.27Gi ± 1%  +41.92% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       36.00Gi ± 1%   51.61Gi ± 1%  +43.35% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       36.17Gi ± 0%   51.68Gi ± 0%  +42.88% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   36.77Gi ± 1%   54.23Gi ± 1%  +47.48% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   35.81Gi ± 1%   50.27Gi ± 0%  +40.39% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   36.40Gi ± 2%   52.29Gi ± 1%  +43.64% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   35.85Gi ± 0%   50.22Gi ± 0%  +40.09% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   36.07Gi ± 1%   52.30Gi ± 1%  +45.00% (p=0.000 n=7+10)
 geomean                                    36.33Gi        90.30Gi       +41.63%

Replace the scalar chunk loop with a NEON 64-byte/iter loop.

For inputs >= 64 bytes, each iteration loads 64 bytes per side with
VLD1.P into four 128-bit registers, compares with VCMEQ (.D2 treats
each 8-byte lane as a unit — any byte difference zeroes the whole lane),
chains the results with VAND, and reduces with VUMINV to a single byte
checked via CBZ for any mismatch.

On a mismatch the two pointers are rewound by 64 bytes and the existing
scalar chunk16 path locates the first differing byte-pair, which REV+CMP
converts to a lexicographic result. This overhead is paid at most once.

For inputs < 64 bytes the scalar chunk16 and byte-level tail are
unchanged. Structure now matches equal_arm64.s directly.

 goos: darwin
 goarch: arm64
 pkg: bytes
 cpu: Apple M3 Pro
                                          │   old.txt   │                new.txt                │
                                          │   sec/op    │   sec/op     vs base                  │
 CompareBytesBigUnaligned/offset=1-11       26.55µ ± 1%   19.21µ ± 1%  -27.63% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       26.83µ ± 3%   19.11µ ± 1%  -28.78% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       26.59µ ± 1%   19.28µ ± 2%  -27.50% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       26.49µ ± 1%   19.11µ ± 1%  -27.84% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       27.04µ ± 3%   19.05µ ± 1%  -29.54% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       27.12µ ± 1%   18.92µ ± 1%  -30.24% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       27.00µ ± 0%   18.90µ ± 0%  -30.01% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   26.56µ ± 1%   18.01µ ± 1%  -32.20% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   27.27µ ± 1%   19.43µ ± 0%  -28.77% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   26.83µ ± 2%   18.68µ ± 1%  -30.38% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   27.24µ ± 0%   19.45µ ± 0%  -28.62% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   27.07µ ± 1%   18.67µ ± 1%  -31.03% (p=0.000 n=7+10)
 geomean                                    26.88µ        10.82µ       -29.39%

                                          │   old.txt    │                new.txt                 │
                                          │     B/s      │     B/s       vs base                  │
 CompareBytesBigUnaligned/offset=1-11       36.78Gi ± 1%   50.82Gi ± 1%  +38.18% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=2-11       36.40Gi ± 3%   51.11Gi ± 1%  +40.41% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=3-11       36.73Gi ± 1%   50.66Gi ± 2%  +37.93% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=4-11       36.87Gi ± 1%   51.09Gi ± 1%  +38.59% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=5-11       36.12Gi ± 3%   51.27Gi ± 1%  +41.92% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=6-11       36.00Gi ± 1%   51.61Gi ± 1%  +43.35% (p=0.000 n=10)
 CompareBytesBigUnaligned/offset=7-11       36.17Gi ± 0%   51.68Gi ± 0%  +42.88% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=0-11   36.77Gi ± 1%   54.23Gi ± 1%  +47.48% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=1-11   35.81Gi ± 1%   50.27Gi ± 0%  +40.39% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=2-11   36.40Gi ± 2%   52.29Gi ± 1%  +43.64% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=3-11   35.85Gi ± 0%   50.22Gi ± 0%  +40.09% (p=0.000 n=10)
 CompareBytesBigBothUnaligned/offset=4-11   36.07Gi ± 1%   52.30Gi ± 1%  +45.00% (p=0.000 n=7+10)
 geomean                                    36.33Gi        90.30Gi       +41.63%
@gopherbot
Copy link
Copy Markdown
Contributor

This PR (HEAD: a9950ce) has been imported to Gerrit for code review.

Please visit Gerrit at https://go-review.googlesource.com/c/go/+/786560.

Important tips:

  • Don't comment on this PR. All discussion takes place in Gerrit.
  • You need a Gmail or other Google account to log in to Gerrit.
  • To change your code in response to feedback:
    • Push a new commit to the branch used by your GitHub PR.
    • A new "patch set" will then appear in Gerrit.
    • Respond to each comment by marking as Done in Gerrit if implemented as suggested. You can alternatively write a reply.
    • Critical: you must click the blue Reply button near the top to publish your Gerrit responses.
    • Multiple commits in the PR will be squashed by GerritBot.
  • The title and description of the GitHub PR are used to construct the final commit message.
    • Edit these as needed via the GitHub web interface (not via Gerrit or git).
    • You should word wrap the PR description at ~76 characters unless you need longer lines (e.g., for tables or URLs).
  • See the Sending a change via GitHub and Reviews sections of the Contribution Guide as well as the FAQ for details.

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Gopher Robot:

Patch Set 1:

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Mauri de Souza Meneguzzo:

Patch Set 3: Commit-Queue+1

(1 comment)


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com:

Patch Set 3:

Dry run: CV is trying the patch.

Bot data: {"action":"start","triggered_at":"2026-06-06T03:01:47Z","revision":"da47cc8cd84fa89e0f686aae5a6e9415b0abe880"}


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from Mauri de Souza Meneguzzo:

Patch Set 3: -Commit-Queue

(Performed by <GERRIT_ACCOUNT_60063> on behalf of <GERRIT_ACCOUNT_63983>)


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com:

Patch Set 3:

This CL has passed the run


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

@gopherbot
Copy link
Copy Markdown
Contributor

Message from golang-scoped@luci-project-accounts.iam.gserviceaccount.com:

Patch Set 3: LUCI-TryBot-Result+1


Please don’t reply on this GitHub thread. Visit golang.org/cl/786560.
After addressing review feedback, remember to publish your drafts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants