Skip to content

Fix - NotAuthenticated errors after service disruptions#3869

Open
kolluria wants to merge 2 commits intokubernetes-sigs:masterfrom
kolluria:fix-pwd-rotation
Open

Fix - NotAuthenticated errors after service disruptions#3869
kolluria wants to merge 2 commits intokubernetes-sigs:masterfrom
kolluria:fix-pwd-rotation

Conversation

@kolluria
Copy link
Contributor

@kolluria kolluria commented Jan 21, 2026

Summary

#3787 introduced retry mechanism to handle container restarts during WCP password rotation. But, the cleanup routine that this PR introduced violates an undocumented assumption that Client of the virtual center should not be nullified.
#3864 reverted this broken password rotation change.

Root Cause

The cleanupVCClient() function was nullifying vc.Client during error recovery. This broke the intended reconnection flow:

  1. After service disruption, cleanupVCClient() nullified vc.Client
  2. Next connect() call created a new client with a valid session
  3. Session validity check passed → early return at line 393
  4. Dependent client recreation code (lines 420-451) never reached
  5. Dependent clients (VsanClient, CnsClient, etc.) retained stale session references
  6. Operations using dependent clients failed with NotAuthenticated

This PR cherry-picks the password rotation back to main with the correct cleanup routine on top. Fixes #3787.

Testing Done

Test Summary

Validated fix across service disruption and password rotation scenarios:

Test Scenario Expected Behavior Result
Bug Reproduction vpxd restart with buggy code NotAuthenticated on VsanClient PASS
Fix Validation vpxd restart with fix All operations succeed(new clients created) PASS
Password Rotation 1 WCP down + secret change + restart 3-min backoff → success PASS
Password Rotation 2 Secret change + vpxd restart(service disruption) 3-min backoff → success PASS

Test Scenarios

Bug Reproduction (Buggy Code)

Setup: Deployed buggy code, enabled DEBUG logging, restarted vpxd

Timeline:

08:02:10 - New vCenter session created successfully
08:03:40 - Auth manager refresh triggered
08:03:40 - HasUserPrivilegeOnEntities: SUCCESS (main SDK works)
08:03:40 - VsanClusterGetConfig: FAILED NotAuthenticated (VsanClient fails)

Result: Successfully reproduced exact customer scenario

Fix Validation (Fixed Code)

Setup: Deployed fix, enabled DEBUG logging, restarted vpxd

Timeline:

08:49:06 - New vCenter session created successfully  
08:50:39 - Auth manager refresh triggered
08:50:39 - HasUserPrivilegeOnEntities: SUCCESS (main SDK works)
08:50:39 - vsan cluster config: SUCCESS (no NotAuthenticated errors)

Result: No NotAuthenticated errors after service disruption

Password Rotation Test 1: WCP Service Down

Scenario: Simulate password rotation with WCP service maintenance

Steps:

  1. Stopped WCP service
  2. Updated secret with wrong password
  3. Restarted CSI deployment
  4. Started WCP service (auto-updated secret)
  5. Observed backoff and retry

Timeline:

10:45:35 - InvalidLogin detected, 3-minute backoff started
10:47:58 - WCP auto-updated secret with correct password
10:48:35 - Retry after backoff → SUCCESS

Result: Automatic recovery after backoff

Password Rotation Test 2: vpxd Restart

Scenario: Simulate password rotation with service restart (no pod restart)

Steps:

  1. Updated secret with wrong password (pod still running)
  2. Restarted vpxd (forced re-authentication)
  3. Observed InvalidLogin and backoff
  4. Restored correct password
  5. Observed successful connection after backoff

Timeline:

10:52:04 - InvalidLogin detected, 3-minute backoff started
10:53:34 - Secret restored with correct password
10:55:07 - Retry after backoff → SUCCESS

Result: No pod restart needed, automatic recovery

Precheckin runs

WCP

Ran 16 of 2324 Specs
SUCCESS! -- 16 Passed | 0 Failed | 0 Pending | 2308 Skipped | 0 Flaked

VKS

Ran 5 of 2324 Specs
SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 2319 Skipped | 0 Flaked

Vanilla

Special Notes for Reviewer

  • The existing vc.connect code had an implicit assumption that vc.Client should persist across reconnection attempts, but this was never documented
  • This fix makes the implicit assumption explicit through code comments
  • This pattern is critical for proper session management but was not obvious from the code

Release Note

Fix NotAuthenticated errors on dependent client operations after vCenter service disruptions by correcting error recovery logic

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolluria

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 21, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 22, 2026
@kolluria kolluria changed the title Fix pwd rotation Fix - NotAuthenticated errors after service disruptions Jan 27, 2026
@kolluria kolluria marked this pull request as ready for review January 27, 2026 09:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2026
@kolluria
Copy link
Contributor Author

/cc @divyenpatel @deepakkinni @xing-yang
/assign @kolluria

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #918

@deepakkinni
Copy link
Collaborator

SUCCESS --- Jenkins Build #821

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #919

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #920

…rnetes-sigs#3787)

Implements retry logic in Connect() method to handle WCP password rotation
scenarios where CSI containers restart during credential updates.

- Add retry loop with exponential backoff (2 attempts, 3min delay)
- Reload vCenter config from secret before each retry attempt
Root cause:
- cleanupVCClient() is called after errors to logout and cleanup
- If we nullify vc.Client, the next connect() call creates a new client
  and then returns early (line 393) when it sees a valid session
- This prevents dependent clients from being recreated (lines 420-451)
- Dependent clients retain stale session references

Correct fix:
- Just logout, don't nullify anything
- connect() will detect the invalid session (after logout)
- Dependent clients are recreated with fresh sessions

This fixes NotAuthenticated errors on VsanClient, CnsClient, etc.
after vCenter service disruptions.
@deepakkinni
Copy link
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #948

@deepakkinni
Copy link
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #947

@deepakkinni
Copy link
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #950

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #950

@deepakkinni
Copy link
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #951

@deepakkinni
Copy link
Collaborator

SUCCESS --- Jenkins Build #951

@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #838

@deepakkinni
Copy link
Collaborator

Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants