fix(storage-users): do not expire the ID cache by default#12416
Conversation
2aa34bd to
fe55e64
Compare
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
@owncloud/ocis-maintainers — friendly nudge: this PR has had no review since it was opened on 2026-06-13 and CI is green. Could someone take a look when you have a moment? Happy to rebase or adjust if anything needs changing. |
kobergj
left a comment
There was a problem hiding this comment.
@dj4oC I dont quite understand the issue. If TTL is breaking the system we need to either fix the issue properly or remove TTL functionality completely. If we just use a sane default everybody who has a TTL set will still run into the issue.
|
@kobergj you're right, thanks — changing only the default doesn't actually fix it.
Since expiring this index is a correctness / data-availability problem rather than a tuning preference, it shouldn't be configurable at all. I'll rework the PR to remove the TTL functionality from the storage-users ID cache entirely — drop the The deeper fix — making POSIX re-resolve on an ID-cache miss instead of treating it as not-found, which would make a TTL safe again — is a larger change in vendored reva; I'd track that as a follow-up. And if bounding the in-memory store's memory is the real concern, that's better served by a size/LRU cap than a time expiry that drops still-needed entries. Reworking now. |
fe55e64 to
34361c8
Compare
|
Pushed the rework (
Went with hard removal rather than a deprecation cycle because the field's dual binding to the global |
kobergj
left a comment
There was a problem hiding this comment.
Nice now 👍 Just please remove unnecessary comment.
DeepDiver1975
left a comment
There was a problem hiding this comment.
Reviewed as maintainer. The technical approach here is sound and I agree with the direction — thanks for reworking it from a "sane default" into a real fix.
Correctness — verified:
cache_ttl: 0genuinely means "never expire": in the reva cache layer (vendor/.../pkg/storage/cache/cache.go) every write usesRecord{Expiry: cache.ttl}+WriteTTL(cache.ttl), and0is treated as no-expiry by both the go-micro memory store andnats-js-kv(MaxAge: 0). Pinning it inrevaconfigfor all five driver builders (Posix/Ocis/OcisNoEvents/S3NG/S3NGNoEvents) is the correct enforcement point.- Stale-ID concern is covered: the id↔path index isn't left to drift because reva actively invalidates it —
tree.gocallsidCache.Delete(...)on move (oldNode path) and on delete (line 251 / 449). So removing the time expiry does not introduce stale entries; entries are removed when the underlying path changes. That's exactly why a TTL was the wrong tool here. - Hard-removing the field rather than picking a smaller default is the right call and correctly answers @kobergj's first point: the field was double-bound to the global
OCIS_CACHE_TTL, so any deployment setting a global cache TTL would still have hit the bug. A field-level deprecation would also have been semantically wrong given that dual binding. FilemetadataCache.TTL(a real cache) is correctly left untouched — confirmed by the guard test.- No dangling
IDCache.TTLreferences remain anywhere outside the changelog; build/vet clean.
Tests: good. TestIDCacheTTLIsPinnedToZero exercises all five driver configs and TestDefaultConfigFilemetadataCacheKeepsTTL guards against accidentally dropping the metadata-cache TTL. Both pass locally.
Docs: removing STORAGE_USERS_ID_CACHE_TTL from docs/helpers/env_vars.yaml is correct and required by the env-var maintenance process (it's maintained in the same PR that removes the var from code); OCIS_CACHE_TTL is correctly retained for the other caches.
One thing still open (not approving on my side until it's resolved): @kobergj already requested (inline on config.go, and "just please remove unnecessary comment") that the multi-line // No TTL field: ... block be dropped since the changelog already captures the rationale. That comment is still present in services/storage-users/pkg/config/config.go. Please remove it (the defaultconfig.go / drivers.go comments are arguably fine to keep since they sit at the enforcement point, but the config.go struct comment is the one called out). Once that nit is addressed and @kobergj's change-request is cleared, this looks good to merge.
Operational note for the release: the PR body already flags it, but worth repeating — existing nats-js-kv ID-cache buckets created with the old 24m MaxAge keep that MaxAge until the bucket is recreated, so existing affected deployments won't be fixed by an upgrade alone.
34361c8 to
1bd2065
Compare
…ver expires The storage-users ID cache holds the authoritative id<->path index. Its TTL was settable via OCIS_CACHE_TTL / STORAGE_USERS_ID_CACHE_TTL (24m default), and the cache layer applies it as a per-write expiry (in-memory) or bucket MaxAge (nats-js-kv). Once entries aged out the provider lost track of existing nodes: files vanished with the POSIX driver, re-resolve thrash with the decomposed drivers. Changing only the default left the footgun reachable via the global OCIS_CACHE_TTL knob, so this removes the TTL functionality from the ID cache entirely: drop the IDCache.TTL field (and its env bindings) and pin the reva cache TTL for the ID cache to 0 in revaconfig, regardless of operator config. FilemetadataCache keeps its TTL. STORAGE_USERS_ID_CACHE_TTL is removed from docs/helpers/env_vars.yaml; OCIS_CACHE_TTL stays (other caches still use it). The deeper fix - having the POSIX driver re-resolve on an ID-cache miss instead of treating it as not-found - belongs in reva and is left as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: David Walter <david.walter@kiteworks.com>
1bd2065 to
d31cc51
Compare
|
Removed the Per @DeepDiver1975's note I kept the short comments that sit at the enforcement point in |
|
Re-checked the latest push ( From my side the change is sound and the requested nit is addressed. @kobergj — when you have a moment, could you re-review / dismiss your earlier |
Problem
The storage-users ID cache — the id↔path index the storage provider relies on to resolve nodes — expired its entries 24 minutes after they were written. Once an entry aged out, the provider could no longer resolve that node:
Root cause
DefaultConfig()(services/storage-users/pkg/config/defaults/defaultconfig.go) setIDCache.TTL = 24 * 60 * time.Second. That value is forwarded to the reva storage driver ascache_ttland consumed by the cache layer (vendor/github.com/owncloud/reva/v2/pkg/storage/cache/cache.go), which writes every entry withRecord{Expiry: cfg.IDCache.TTL}. For in-memory stores that is a per-write expiry; for thenats-js-kvstore it becomes the bucket-wideMaxAge. Either way the index entries expired after 24 minutes.The ID cache is an authoritative index, not transient data — it is not supposed to expire. (The field's
desceven still describes the unrelated OIDC "user info cache", indicating the TTL was never meant for this cache.)Fix
Remove the default TTL from the
IDCacheconfig block so it defaults to0, which both store backends treat as "no expiry":Record.Expiry == 0→ entry never expires (go-micro.dev/v4/store/memory.go);nats-js-kv:DefaultTTL(0)→ bucket created withMaxAge: 0→ no expiry.The
FilemetadataCacheTTL is intentionally left unchanged (it is a real cache). TheSTORAGE_USERS_ID_CACHE_TTL/OCIS_CACHE_TTLknob is kept for operators who explicitly want a TTL; this matches the intent of the upstream opencloud-eu/opencloud change (which dropped the setting entirely — happy to do the same here if preferred).Testing
go test ./services/storage-users/pkg/config/defaults/ -run IDCache— newdefaultconfig_test.goassertsIDCache.TTL == 0and thatFilemetadataCache.TTLis unchanged. Fails on master (24m0s), passes with the fix.go build ./services/storage-users/...andgo vetclean.Risk
Low. Config-default-only change; no code paths altered, no public/CS3 API change, no vendored code touched.
Operational note: existing
nats-js-kvID-cache buckets created with the previous 24mMaxAgekeep it until the bucket is recreated. The fix prevents new deployments from getting the harmful default.