Skip to content

reverse_tunnel: proactively replace draining connections instead of GOAWAY#45519

Open
aakugan wants to merge 1 commit into
envoyproxy:mainfrom
aakugan:fix/use-shutdown-notice
Open

reverse_tunnel: proactively replace draining connections instead of GOAWAY#45519
aakugan wants to merge 1 commit into
envoyproxy:mainfrom
aakugan:fix/use-shutdown-notice

Conversation

@aakugan

@aakugan aakugan commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Commit Message

reverse_tunnel: proactively replace draining connections instead of GOAWAY

Description

Reworks the drain-aware HTTP connection manager used by reverse tunnels so that draining a connection no longer creates a window where the cluster has nowhere to route.

Why this change:
The HCM drains an HTTP/2 connection in two phases — shutdownNotice() (a graceful GOAWAY that stops new streams) followed by goAway() after drain_timeout. The problem for reverse tunnels: the moment a GOAWAY is observed, the responder's upstream nghttp2/oghttp2 adapter marks the connection Draining and refuses new streams on it. If no replacement tunnel exists yet, requests fail (503s) until a fresh tunnel is independently established.

What we do now:

  • DrainAwareServerConnection wraps the server codec and swallows the phase-1 shutdownNotice() so no premature GOAWAY is emitted. The only GOAWAY ever put on the wire is the single, guarded final goAway().
  • There are two drain triggers, both handled:
    1. Connection-level drain (max_connection_duration, max_requests, ...): the HCM drives it. We swallow shutdownNotice() and instead call reestablishConnection() to bring up a REPLACEMENT reverse tunnel while the old connection keeps serving
      during drain_timeout. The HCM's own goAway() then closes the old connection, by which point the cluster already has the new tunnel to route to.
    2. Listener-level drain (hot-restart, /drain_listeners, healthcheck fail): the HCM only checks drainClose() while encoding a response, so it never notices a drain on an idle connection. We poll drainClose() on a 100ms timer and, on detection, schedule the final goAway() after drain_timeout (no reconnect as the listener is dying), giving any in-flight request the same graceful window.
  • A bool guard ensures the final GOAWAY is sent at most once across both paths.
    Net effect: earlier, a GOAWAY made both client and server stop using the connection immediately, opening a connectivity gap. Now the server brings up a new tunnel first and only sends the GOAWAY once a replacement is ready, so the cluster transparently shifts to the new connection.

Future

Need to find some way of passing the info of draining to the cluster without preventing adapter from rejecting new stream creation which will allow us to notify the client consumers in advance to not "prefer" this tunnel.

Testing

Unit and Integ tests.

@aakugan aakugan marked this pull request as draft June 9, 2026 10:41
…OAWAY

Signed-off-by: aakugan <aakashganapathy2@gmail.com>
@aakugan aakugan force-pushed the fix/use-shutdown-notice branch from 9ee5bed to 423aa92 Compare June 9, 2026 13:12
@aakugan aakugan marked this pull request as ready for review June 9, 2026 14:13
@KBaichoo

Copy link
Copy Markdown
Contributor

/assign @botengyao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants