Skip to content

Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks#5430

Open
shovan-mondal wants to merge 3 commits intolitmuschaos:masterfrom
shovan-mondal:singleton-client
Open

Fix 504 timeout: Add K8s client caching + QPS increase + benchmarks#5430
shovan-mondal wants to merge 3 commits intolitmuschaos:masterfrom
shovan-mondal:singleton-client

Conversation

@shovan-mondal
Copy link
Contributor

Proposed changes

Fixes #5079 (504 Gateway Timeouts).

This PR addresses a critical concurrency anti-pattern in the subscriber's GetGenericK8sClient, where a new kubernetes.Clientset was being initialized for every single request.

The Issue:

  1. Rate Limit Bypass: Creating a new client for every request resets the token bucket, effectively bypassing client-side rate limits and allowing the subscriber to flood the API server.
  2. Socket Exhaustion: In production environments, this causes high TCP connection churn (new TLS handshake per request), leading to latency accumulation and eventually 504 Gateway Timeouts.

The Fix:

  • Implemented sync.Once to enforce a Singleton pattern for the Kubernetes client (reusing the TCP connection).
  • Optimized rest.Config with QPS=50 and Burst=100 to handle concurrent UI requests without client-side throttling.
  • Added client_perf_test.go to validate the performance improvement.

Benchmark Results:
I ran a parallel benchmark simulating 20 concurrent requests.

Scenario Total Time Avg/Request Connections Status
Current Bug (New Client/Request) 40.15ms 2.00ms Multiple (20) Connection Waste
Naive Fix (Reuse + Low QPS) 7.51s 375.61ms Single (1) Throttled
This PR (Reuse + High QPS) 38.55ms 1.92ms Single (1) Optimal

Types of changes

What types of changes does your code introduce to Litmus? Put an x in the boxes that apply

  • New feature (non-breaking change which adds functionality)
  • Bugfix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices applies)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the commit for DCO to be passed.
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added necessary documentation (if appropriate)

Dependency

  • None

Special notes for your reviewer:

I have included a new test file pkg/k8s/parallel_benchmark_test.go which runs the benchmark scenarios shown above. You can verify the fix locally by running:

cd chaoscenter/subscriber/pkg/k8s
go test -v -run "TestParallelBenchmark_PRResults" -timeout 120s

This test proves that the singleton implementation matches the speed of the unthrottled code (~38ms) but maintains a single persistent connection, eliminating the TLS handshake overhead that causes the 504s in production.

Copy link
Contributor

@amityt amityt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @shovan-mondal Will definitely help in optimization and scaling. 🚀

@shovan-mondal
Copy link
Contributor Author

Great suggestion, @amityt You're completely right, different cluster sizes will need different limits. I will move these to environment variables while keeping 50 and 100 as the sensible defaults. I'll push the updated commit shortly!

Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>
Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>
Signed-off-by: shovan-mondal <shovanmondal2004@gmail.com>
@shovan-mondal
Copy link
Contributor Author

@amityt I havepushed the settings to environment variables. Now Ready for review

@shovan-mondal shovan-mondal marked this pull request as ready for review February 19, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random timeouts for GraphQL getKubeNamespace/getKubeObject

2 participants