Fix - NotAuthenticated errors after service disruptions#3869
Fix - NotAuthenticated errors after service disruptions#3869kolluria wants to merge 2 commits intokubernetes-sigs:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kolluria The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
79976f8 to
321973c
Compare
321973c to
8a1a641
Compare
|
/cc @divyenpatel @deepakkinni @xing-yang |
8a1a641 to
5af6889
Compare
|
FAILED --- Jenkins Build #918 |
|
SUCCESS --- Jenkins Build #821 |
|
FAILED --- Jenkins Build #919 |
|
FAILED --- Jenkins Build #920 |
5af6889 to
7d5f5d4
Compare
…rnetes-sigs#3787) Implements retry logic in Connect() method to handle WCP password rotation scenarios where CSI containers restart during credential updates. - Add retry loop with exponential backoff (2 attempts, 3min delay) - Reload vCenter config from secret before each retry attempt
Root cause: - cleanupVCClient() is called after errors to logout and cleanup - If we nullify vc.Client, the next connect() call creates a new client and then returns early (line 393) when it sees a valid session - This prevents dependent clients from being recreated (lines 420-451) - Dependent clients retain stale session references Correct fix: - Just logout, don't nullify anything - connect() will detect the invalid session (after logout) - Dependent clients are recreated with fresh sessions This fixes NotAuthenticated errors on VsanClient, CnsClient, etc. after vCenter service disruptions.
7d5f5d4 to
08186c8
Compare
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
FAILED --- Jenkins Build #950 |
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
SUCCESS --- Jenkins Build #951 |
|
FAILED --- Jenkins Build #838 |
|
Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
Summary
#3787 introduced retry mechanism to handle container restarts during WCP password rotation. But, the cleanup routine that this PR introduced violates an undocumented assumption that
Clientof the virtual center should not be nullified.#3864 reverted this broken password rotation change.
Root Cause
The
cleanupVCClient()function was nullifyingvc.Clientduring error recovery. This broke the intended reconnection flow:cleanupVCClient()nullifiedvc.Clientconnect()call created a new client with a valid sessionThis PR cherry-picks the password rotation back to main with the correct cleanup routine on top. Fixes #3787.
Testing Done
Test Summary
Validated fix across service disruption and password rotation scenarios:
Test Scenarios
Bug Reproduction (Buggy Code)
Setup: Deployed buggy code, enabled DEBUG logging, restarted vpxd
Timeline:
Result: Successfully reproduced exact customer scenario
Fix Validation (Fixed Code)
Setup: Deployed fix, enabled DEBUG logging, restarted vpxd
Timeline:
Result: No NotAuthenticated errors after service disruption
Password Rotation Test 1: WCP Service Down
Scenario: Simulate password rotation with WCP service maintenance
Steps:
Timeline:
Result: Automatic recovery after backoff
Password Rotation Test 2: vpxd Restart
Scenario: Simulate password rotation with service restart (no pod restart)
Steps:
Timeline:
Result: No pod restart needed, automatic recovery
Precheckin runs
WCP
VKS
Vanilla
Special Notes for Reviewer
vc.connectcode had an implicit assumption thatvc.Clientshould persist across reconnection attempts, but this was never documentedRelease Note