Skip to content

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397

Open
Sanchit2662 wants to merge 1 commit intolitmuschaos:masterfrom
Sanchit2662:fix/subscription-memory-leak-deadlock
Open

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397
Sanchit2662 wants to merge 1 commit intolitmuschaos:masterfrom
Sanchit2662:fix/subscription-memory-leak-deadlock

Conversation

@Sanchit2662
Copy link

@Sanchit2662 Sanchit2662 commented Jan 14, 2026

Summary

This PR fixes a critical concurrency issue in the ChaosCenter GraphQL subscription layer that could lead to unbounded memory growth and a process-wide deadlock under normal UI usage.

Specifically, GetInfraEvents subscriptions were leaking channels after client disconnects, and SendInfraEvent could block indefinitely while holding a shared mutex. Over time, this caused the GraphQL server to become unresponsive with no crash logs or clear error signals.

The fix ensures proper subscription cleanup, prevents blocking sends, and hardens related cleanup paths against concurrent map access.


Fix

1. Proper subscription cleanup on disconnect

Channels are now removed from the publisher slice when the subscription context is cancelled:

go func() {
    <-ctx.Done()
    data_store.Store.Mutex.Lock()
    channels := data_store.Store.InfraEventPublish[projectID]
    for i, ch := range channels {
        if ch == infraEvent {
            data_store.Store.InfraEventPublish[projectID] =
                append(channels[:i], channels[i+1:]...)
            break
        }
    }
    data_store.Store.Mutex.Unlock()
}()

2. Non-blocking event delivery to prevent deadlocks

Event publishing no longer blocks on slow or disconnected subscribers:

for _, observer := range r.InfraEventPublish[infra.ProjectID] {
    select {
    case observer <- &newEvent:
    default:
        // skip slow/dead subscriber
    }
}

This ensures one stalled subscription cannot block the entire system.


3. Thread-safe cleanup in related subscriptions

Cleanup paths in GetPodLog, GetKubeObject, and GetKubeNamespace now properly guard map deletes with the shared mutex, preventing concurrent map access panics.


Impact

  • Memory leak eliminated: subscription channels are no longer leaked.
  • Deadlock prevented: event publishing cannot block while holding the mutex.
  • Improved resilience: slow or disconnected clients degrade gracefully.
  • Stability improved: prevents rare but severe production outages in ChaosCenter.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)

Checklist

- Add proper cleanup in GetInfraEvents to remove channels on disconnect
- Use non-blocking sends in SendInfraEvent to prevent mutex deadlock
- Add mutex protection to map deletes in GetPodLog, GetKubeObject, GetKubeNamespace

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662
Copy link
Author

Sanchit2662 commented Jan 15, 2026

Hi @PriteshKiri, @amityt , @SarthakJain26
I’ve updated the PR to address the issue and adjusted the implementation accordingly. This helps avoid a potential memory leak and deadlock in the GraphQL subscription flow by improving how subscriptions are cleaned up and how events are delivered.

Whenever you get a chance, I’d really appreciate a review. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant