Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks by shovan-mondal · Pull Request #5430 · litmuschaos/litmus

shovan-mondal · 2026-02-02T14:09:23Z

Proposed changes

Fixes #5079 (504 Gateway Timeouts).

This PR addresses a critical concurrency anti-pattern in the subscriber's GetGenericK8sClient, where a new kubernetes.Clientset was being initialized for every single request.

The Issue:

Rate Limit Bypass: Creating a new client for every request resets the token bucket, effectively bypassing client-side rate limits and allowing the subscriber to flood the API server.
Socket Exhaustion: In production environments, this causes high TCP connection churn (new TLS handshake per request), leading to latency accumulation and eventually 504 Gateway Timeouts.

The Fix:

Implemented sync.Once to enforce a Singleton pattern for the Kubernetes client (reusing the TCP connection).
Optimized rest.Config with QPS=50 and Burst=100 to handle concurrent UI requests without client-side throttling.
Added client_perf_test.go to validate the performance improvement.

Benchmark Results:
I ran a parallel benchmark simulating 20 concurrent requests.

Scenario	Total Time	Avg/Request	Connections	Status
Current Bug (New Client/Request)	40.15ms	2.00ms	Multiple (20)	Connection Waste
Naive Fix (Reuse + Low QPS)	7.51s	375.61ms	Single (1)	Throttled
This PR (Reuse + High QPS)	38.55ms	1.92ms	Single (1)	Optimal

Types of changes

What types of changes does your code introduce to Litmus? Put an x in the boxes that apply

New feature (non-breaking change which adds functionality)
Bugfix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices applies)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the commit for DCO to be passed.
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added necessary documentation (if appropriate)

Dependency

None

Special notes for your reviewer:

I have included a new test file pkg/k8s/parallel_benchmark_test.go which runs the benchmark scenarios shown above. You can verify the fix locally by running:

cd chaoscenter/subscriber/pkg/k8s
go test -v -run "TestParallelBenchmark_PRResults" -timeout 120s

This test proves that the singleton implementation matches the speed of the unthrottled code (~38ms) but maintains a single persistent connection, eliminating the TLS handshake overhead that causes the 504s in production.

amityt

Thanks for the changes @shovan-mondal Will definitely help in optimization and scaling. 🚀

chaoscenter/subscriber/pkg/k8s/client.go

shovan-mondal · 2026-02-17T16:37:24Z

Great suggestion, @amityt You're completely right, different cluster sizes will need different limits. I will move these to environment variables while keeping 50 and 100 as the sensible defaults. I'll push the updated commit shortly!

Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>

shovan-mondal · 2026-02-19T12:07:54Z

@amityt I havepushed the settings to environment variables. Now Ready for review

shovan-mondal force-pushed the singleton-client branch from 55066e9 to 801e684 Compare February 2, 2026 14:11

shovan-mondal mentioned this pull request Feb 2, 2026

Random timeouts for GraphQL getKubeNamespace/getKubeObject #5079

Open

amityt approved these changes Feb 17, 2026

View reviewed changes

amityt reviewed Feb 17, 2026

View reviewed changes

chaoscenter/subscriber/pkg/k8s/client.go Outdated Show resolved Hide resolved

shovan-mondal added 3 commits February 19, 2026 11:21

Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks

22b9174

Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>

Fix Go formatting in parallel_benchmark_test.go

0815c99

Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>

Make K8s client QPS/Burst/Timeout configurable via env variables

b246e85

Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>

shovan-mondal force-pushed the singleton-client branch from 5f20ffd to b246e85 Compare February 19, 2026 11:26

shovan-mondal marked this pull request as ready for review February 19, 2026 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks#5430

Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks#5430
shovan-mondal wants to merge 3 commits intolitmuschaos:masterfrom
shovan-mondal:singleton-client

shovan-mondal commented Feb 2, 2026

Uh oh!

amityt left a comment

Uh oh!

Uh oh!

shovan-mondal commented Feb 17, 2026

Uh oh!

shovan-mondal commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shovan-mondal commented Feb 2, 2026

Proposed changes

Types of changes

Checklist

Dependency

Special notes for your reviewer:

Uh oh!

amityt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shovan-mondal commented Feb 17, 2026

Uh oh!

shovan-mondal commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants