Skip to content

fix(storage-users): do not expire the ID cache by default#12416

Open
dj4oC wants to merge 1 commit into
owncloud:masterfrom
dj4oC:fix/storage-users/id-cache-no-expiry
Open

fix(storage-users): do not expire the ID cache by default#12416
dj4oC wants to merge 1 commit into
owncloud:masterfrom
dj4oC:fix/storage-users/id-cache-no-expiry

Conversation

@dj4oC

@dj4oC dj4oC commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Problem

The storage-users ID cache — the id↔path index the storage provider relies on to resolve nodes — expired its entries 24 minutes after they were written. Once an entry aged out, the provider could no longer resolve that node:

  • with the POSIX driver, affected files and folders appeared to vanish from listings (and sync clients would be told to delete them locally);
  • with the decomposed drivers it degraded to repeatedly re-resolving entries from disk (cache thrash + extra IO).

Root cause

DefaultConfig() (services/storage-users/pkg/config/defaults/defaultconfig.go) set IDCache.TTL = 24 * 60 * time.Second. That value is forwarded to the reva storage driver as cache_ttl and consumed by the cache layer (vendor/github.com/owncloud/reva/v2/pkg/storage/cache/cache.go), which writes every entry with Record{Expiry: cfg.IDCache.TTL}. For in-memory stores that is a per-write expiry; for the nats-js-kv store it becomes the bucket-wide MaxAge. Either way the index entries expired after 24 minutes.

The ID cache is an authoritative index, not transient data — it is not supposed to expire. (The field's desc even still describes the unrelated OIDC "user info cache", indicating the TTL was never meant for this cache.)

Fix

Remove the default TTL from the IDCache config block so it defaults to 0, which both store backends treat as "no expiry":

  • in-memory store: Record.Expiry == 0 → entry never expires (go-micro.dev/v4/store/memory.go);
  • nats-js-kv: DefaultTTL(0) → bucket created with MaxAge: 0 → no expiry.

The FilemetadataCache TTL is intentionally left unchanged (it is a real cache). The STORAGE_USERS_ID_CACHE_TTL / OCIS_CACHE_TTL knob is kept for operators who explicitly want a TTL; this matches the intent of the upstream opencloud-eu/opencloud change (which dropped the setting entirely — happy to do the same here if preferred).

Testing

  • go test ./services/storage-users/pkg/config/defaults/ -run IDCache — new defaultconfig_test.go asserts IDCache.TTL == 0 and that FilemetadataCache.TTL is unchanged. Fails on master (24m0s), passes with the fix.
  • go build ./services/storage-users/... and go vet clean.

Risk

Low. Config-default-only change; no code paths altered, no public/CS3 API change, no vendored code touched.

Operational note: existing nats-js-kv ID-cache buckets created with the previous 24m MaxAge keep it until the bucket is recreated. The fix prevents new deployments from getting the harmful default.

@dj4oC dj4oC force-pushed the fix/storage-users/id-cache-no-expiry branch from 2aa34bd to fe55e64 Compare June 13, 2026 08:03
@kw-security

kw-security commented Jun 13, 2026

Copy link
Copy Markdown

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@dj4oC

dj4oC commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@owncloud/ocis-maintainers — friendly nudge: this PR has had no review since it was opened on 2026-06-13 and CI is green. Could someone take a look when you have a moment? Happy to rebase or adjust if anything needs changing.

@kobergj kobergj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dj4oC I dont quite understand the issue. If TTL is breaking the system we need to either fix the issue properly or remove TTL functionality completely. If we just use a sane default everybody who has a TTL set will still run into the issue.

@dj4oC

dj4oC commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@kobergj you're right, thanks — changing only the default doesn't actually fix it.

IDCache.TTL is bound to OCIS_CACHE_TTL (the global cache knob) as well as STORAGE_USERS_ID_CACHE_TTL, so any deployment that sets a global cache TTL still writes the id↔path index with an expiry and hits the same failure: nodes vanish with the POSIX driver, re-resolve thrash with the decomposed ones. A "sane default" leaves that fully reachable.

Since expiring this index is a correctness / data-availability problem rather than a tuning preference, it shouldn't be configurable at all. I'll rework the PR to remove the TTL functionality from the storage-users ID cache entirely — drop the TTL field and its OCIS_CACHE_TTL;STORAGE_USERS_ID_CACHE_TTL bindings so the index can never be set to expire, force 0 to the reva cache, and fix the field's stale "user info cache" desc. FilemetadataCache keeps its TTL (it's a real cache).

The deeper fix — making POSIX re-resolve on an ID-cache miss instead of treating it as not-found, which would make a TTL safe again — is a larger change in vendored reva; I'd track that as a follow-up. And if bounding the in-memory store's memory is the real concern, that's better served by a size/LRU cap than a time expiry that drops still-needed entries.

Reworking now.

@dj4oC dj4oC force-pushed the fix/storage-users/id-cache-no-expiry branch from fe55e64 to 34361c8 Compare June 24, 2026 16:09
@dj4oC

dj4oC commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Pushed the rework (34361c8e):

  • Removed the IDCache.TTL field and its OCIS_CACHE_TTL;STORAGE_USERS_ID_CACHE_TTL bindings, so the id↔path index can no longer be configured to expire by anyone.
  • Pinned the reva ID-cache cache_ttl to 0 in revaconfig (hard-coded, not operator-configurable).
  • FilemetadataCache keeps its TTL. Removed STORAGE_USERS_ID_CACHE_TTL from docs/helpers/env_vars.yaml; OCIS_CACHE_TTL stays (other caches still use it).
  • Added a revaconfig test asserting cache_ttl == 0 for all ID-cache driver configs (Posix/Ocis/OcisNoEvents/S3NG/S3NGNoEvents); kept a guard that FilemetadataCache.TTL is unchanged.

Went with hard removal rather than a deprecation cycle because the field's dual binding to the global OCIS_CACHE_TTL makes a field-level deprecation annotation incorrect, and a silently-ignored knob would be its own footgun. go build, go test and go vet are clean locally. The deeper fix (POSIX re-resolving on an ID-cache miss, which would make a TTL safe again) is a reva change I'd track as a follow-up.

@dj4oC dj4oC requested a review from kobergj June 24, 2026 16:09

@kobergj kobergj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice now 👍 Just please remove unnecessary comment.

Comment thread services/storage-users/pkg/config/config.go Outdated

@DeepDiver1975 DeepDiver1975 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed as maintainer. The technical approach here is sound and I agree with the direction — thanks for reworking it from a "sane default" into a real fix.

Correctness — verified:

  • cache_ttl: 0 genuinely means "never expire": in the reva cache layer (vendor/.../pkg/storage/cache/cache.go) every write uses Record{Expiry: cache.ttl} + WriteTTL(cache.ttl), and 0 is treated as no-expiry by both the go-micro memory store and nats-js-kv (MaxAge: 0). Pinning it in revaconfig for all five driver builders (Posix/Ocis/OcisNoEvents/S3NG/S3NGNoEvents) is the correct enforcement point.
  • Stale-ID concern is covered: the id↔path index isn't left to drift because reva actively invalidates it — tree.go calls idCache.Delete(...) on move (oldNode path) and on delete (line 251 / 449). So removing the time expiry does not introduce stale entries; entries are removed when the underlying path changes. That's exactly why a TTL was the wrong tool here.
  • Hard-removing the field rather than picking a smaller default is the right call and correctly answers @kobergj's first point: the field was double-bound to the global OCIS_CACHE_TTL, so any deployment setting a global cache TTL would still have hit the bug. A field-level deprecation would also have been semantically wrong given that dual binding.
  • FilemetadataCache.TTL (a real cache) is correctly left untouched — confirmed by the guard test.
  • No dangling IDCache.TTL references remain anywhere outside the changelog; build/vet clean.

Tests: good. TestIDCacheTTLIsPinnedToZero exercises all five driver configs and TestDefaultConfigFilemetadataCacheKeepsTTL guards against accidentally dropping the metadata-cache TTL. Both pass locally.

Docs: removing STORAGE_USERS_ID_CACHE_TTL from docs/helpers/env_vars.yaml is correct and required by the env-var maintenance process (it's maintained in the same PR that removes the var from code); OCIS_CACHE_TTL is correctly retained for the other caches.

One thing still open (not approving on my side until it's resolved): @kobergj already requested (inline on config.go, and "just please remove unnecessary comment") that the multi-line // No TTL field: ... block be dropped since the changelog already captures the rationale. That comment is still present in services/storage-users/pkg/config/config.go. Please remove it (the defaultconfig.go / drivers.go comments are arguably fine to keep since they sit at the enforcement point, but the config.go struct comment is the one called out). Once that nit is addressed and @kobergj's change-request is cleared, this looks good to merge.

Operational note for the release: the PR body already flags it, but worth repeating — existing nats-js-kv ID-cache buckets created with the old 24m MaxAge keep that MaxAge until the bucket is recreated, so existing affected deployments won't be fixed by an upgrade alone.

@dj4oC dj4oC force-pushed the fix/storage-users/id-cache-no-expiry branch from 34361c8 to 1bd2065 Compare June 27, 2026 06:32
…ver expires

The storage-users ID cache holds the authoritative id<->path index. Its TTL
was settable via OCIS_CACHE_TTL / STORAGE_USERS_ID_CACHE_TTL (24m default), and
the cache layer applies it as a per-write expiry (in-memory) or bucket MaxAge
(nats-js-kv). Once entries aged out the provider lost track of existing nodes:
files vanished with the POSIX driver, re-resolve thrash with the decomposed
drivers.

Changing only the default left the footgun reachable via the global
OCIS_CACHE_TTL knob, so this removes the TTL functionality from the ID cache
entirely: drop the IDCache.TTL field (and its env bindings) and pin the reva
cache TTL for the ID cache to 0 in revaconfig, regardless of operator config.
FilemetadataCache keeps its TTL. STORAGE_USERS_ID_CACHE_TTL is removed from
docs/helpers/env_vars.yaml; OCIS_CACHE_TTL stays (other caches still use it).

The deeper fix - having the POSIX driver re-resolve on an ID-cache miss instead
of treating it as not-found - belongs in reva and is left as a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: David Walter <david.walter@kiteworks.com>
@dj4oC dj4oC force-pushed the fix/storage-users/id-cache-no-expiry branch from 1bd2065 to d31cc51 Compare June 27, 2026 06:38
@dj4oC

dj4oC commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Removed the // No TTL field: ... comment block from services/storage-users/pkg/config/config.go (the rationale lives in the changelog) in d31cc51.

Per @DeepDiver1975's note I kept the short comments that sit at the enforcement point in revaconfig/drivers.go (and defaultconfig.go), since they explain why the TTL is pinned to 0 right there; happy to drop those too if you'd prefer zero comments. gofmt realigned the struct after the comment came out. Tests still green (TestIDCacheTTLIsPinnedToZero, TestDefaultConfigFilemetadataCacheKeepsTTL), build + vet clean.

@dj4oC dj4oC requested review from DeepDiver1975 and kobergj June 27, 2026 06:38
@DeepDiver1975

Copy link
Copy Markdown
Member

Re-checked the latest push (d31cc51): the desc/changelog-referencing comment block @kobergj flagged on services/storage-users/pkg/config/config.go is now removed, and the new test guards (drivers_test.go pinning cache_ttl to 0 across all driver builders, plus the metadata-cache TTL guard) are in place. CI is fully green (only k6-load-test is skipped, which is expected).

From my side the change is sound and the requested nit is addressed. @kobergj — when you have a moment, could you re-review / dismiss your earlier changes requested? That's the only thing still blocking here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants