Skip to content

feat: add manual failover management#444

Open
SantiagoDePolonia wants to merge 5 commits into
mainfrom
feature/failover-management
Open

feat: add manual failover management#444
SantiagoDePolonia wants to merge 5 commits into
mainfrom
feature/failover-management

Conversation

@SantiagoDePolonia

@SantiagoDePolonia SantiagoDePolonia commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add manual-only failover runtime rules with env/config/dashboard support
  • keep deprecated default_mode accepted but ignored at runtime
  • store failover rules separately from virtual models and expose admin APIs/dashboard UI
  • add failover reset and automatic suggestion generation for dashboard workflows

Verification

  • go test ./cmd/... gomodel/internal/admin
    gomodel/internal/admin/dashboard
    gomodel/internal/anthropicapi
    gomodel/internal/app
    gomodel/internal/auditlog
    gomodel/internal/authkeys
    gomodel/internal/batch
    gomodel/internal/batchrewrite
    gomodel/internal/budget
    gomodel/internal/cache
    gomodel/internal/cache/modelcache
    gomodel/internal/conversationstore
    gomodel/internal/core
    gomodel/internal/embedding
    gomodel/internal/failover
    gomodel/internal/fallback
    gomodel/internal/filestore
    gomodel/internal/gateway
    gomodel/internal/guardrails
    gomodel/internal/httpclient
    gomodel/internal/live
    gomodel/internal/llmclient
    gomodel/internal/modeldata
    gomodel/internal/modelselectors
    gomodel/internal/observability
    gomodel/internal/pricingoverrides
    gomodel/internal/providers
    gomodel/internal/providers/anthropic
    gomodel/internal/providers/azure
    gomodel/internal/providers/bailian
    gomodel/internal/providers/bedrock
    gomodel/internal/providers/deepseek
    gomodel/internal/providers/gemini
    gomodel/internal/providers/googlecommon
    gomodel/internal/providers/groq
    gomodel/internal/providers/minimax
    gomodel/internal/providers/ollama
    gomodel/internal/providers/openai
    gomodel/internal/providers/opencodego
    gomodel/internal/providers/openrouter
    gomodel/internal/providers/oracle
    gomodel/internal/providers/vertex
    gomodel/internal/providers/vllm
    gomodel/internal/providers/xai
    gomodel/internal/providers/xiaomi
    gomodel/internal/providers/zai
    gomodel/internal/realtime
    gomodel/internal/responsecache
    gomodel/internal/responsestore
    gomodel/internal/server
    gomodel/internal/storage
    gomodel/internal/storage/sqlutil
    gomodel/internal/streaming
    gomodel/internal/usage
    gomodel/internal/validation
    gomodel/internal/version
    gomodel/internal/virtualmodels
    gomodel/internal/workflows ./config/...
  • node --test internal/admin/dashboard/static/js/modules/*.test.cjs
  • pre-commit hooks: make test-race, go mod tidy, dashboard JS tests, performance guard, make lint, mint validate

Summary by CodeRabbit

  • New Features

    • Added a new Failover admin page with rule management, reset, and auto-generate actions.
    • Introduced a failover editor in the dashboard and a new failover icon/action in model tables.
    • Expanded dashboard and runtime config to support a Failover enabled toggle.
  • Bug Fixes

    • Updated failover behavior to use manual rules and disable lists consistently.
    • Improved handling of legacy fallback settings so older values remain accepted as compatibility inputs.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a complete failover rules management system: new internal/failover package with types, Store interface, SQLite/PostgreSQL/MongoDB backends, an in-memory Service with background refresh, and a factory. Extends FallbackConfig with Enabled/Rules/Disabled inputs, reworks FallbackResolver to use manual-only runtime rules with a dynamic RuleProvider, wires everything through the app, registers admin REST endpoints, and ships a new admin dashboard page.

Changes

Failover Rules Management

Layer / File(s) Summary
Config schema and fallback resolution
config/fallback.go, config/config.go, internal/fallback/resolver.go, config/config_test.go, internal/fallback/resolver_test.go
FallbackConfig gains Enabled, inline Rules/RulesJSON, and DisabledModels/DisabledModelsJSON inputs plus a computed Disabled map. Deprecated default_mode validation is relaxed. decodeFallbackRuleJSON strictly parses rule JSON. FallbackResolver is reworked to use Enabled/Manual/Disabled fields plus an optional RuleProvider for dynamic merging; SuggestFallbacks is added for auto-suggestion. Tests are updated accordingly.
Failover types and storage backends
internal/failover/types.go, internal/failover/store.go, internal/failover/store_sqlite.go, internal/failover/store_postgresql.go, internal/failover/store_mongodb.go, internal/failover/store_sqlite_test.go
Exports Rule/View structs and ManagedSourceDashboard/ManagedSourceConfig constants. Store interface and shared helpers (encodeTargets, decodeTargets, stampUpsert, collectRules) are defined. SQLite, PostgreSQL, and MongoDB backends implement CRUD with table/index creation, schema migration, and row encoding.
Failover service and factory
internal/failover/service.go, internal/failover/factory.go, internal/failover/service_test.go
Service maintains an atomic snapshot merging config-managed and dashboard-persisted rules, with Refresh, read/write APIs, and StartBackgroundRefresh. factory.go provides New/NewWithSharedStorage constructors, backend resolution via createStore, and idempotent Result.Close.
Admin failover wiring and endpoints
internal/admin/handler.go, internal/admin/handler_failover.go, internal/admin/routes.go, internal/admin/routes_test.go, internal/admin/handler_test.go, internal/app/app.go, internal/app/app_test.go, internal/server/http_test.go, docs/openapi.json
Handler gains failoverRules dependency and WithFailover option. DashboardConfigResponse replaces FEATURE_FALLBACK_MODE with FAILOVER_ENABLED. Five failover endpoints are registered (list, upsert, delete, reset, generate). App initializes the failover subsystem and wires it into the server and admin handler. OpenAPI adds failover and throughput schemas.
Dashboard failover UI
internal/admin/dashboard/static/js/modules/failover.js, internal/admin/dashboard/static/js/dashboard.js, internal/admin/dashboard/templates/..., internal/admin/dashboard/static/css/dashboard.css, internal/admin/dashboard/static/js/modules/failover.test.cjs
New dashboardFailoverModule manages rule state, async fetch/save/delete/reset/generate flows, and form lifecycle. Dashboard routing and data-fetch prefetch are extended for the failover page. Templates add a sidebar link, settings section, page template, editor modal, model-table icon button, and failover-icon SVG. CSS adds active-mapping button styles.
Runtime flag migration and docs
internal/admin/dashboard/static/js/modules/workflows.js, internal/admin/dashboard/static/js/modules/workflows.test.cjs, internal/admin/dashboard/static/js/modules/workflows-layout.test.cjs, .env.template, config/config.example.yaml, docs/features/failover.mdx, internal/fallback/resolver_test.go, internal/app/app_test.go
FEATURE_FALLBACK_MODE is replaced with FAILOVER_ENABLED in workflow runtime-config keys, feature-cap derivation, and all related tests. Env template, example YAML, and failover docs are updated to document the new FAILOVER_ENABLED/FAILOVER_RULES_JSON/FAILOVER_DISABLED_MODELS_JSON inputs and deprecate legacy mode flags.

Sequence Diagram(s)

sequenceDiagram
  participant AdminUI as Dashboard UI
  participant Handler as admin.Handler
  participant Service as failover.Service
  participant Store as failover.Store
  participant Resolver as fallback.Resolver

  AdminUI->>Handler: GET /admin/failover
  Handler->>Service: ListViews()
  Service-->>Handler: []View
  Handler-->>AdminUI: JSON rule list

  AdminUI->>Handler: PUT /admin/failover
  Handler->>Service: Upsert(Rule)
  Service->>Store: Upsert(Rule)
  Store-->>Service: ok
  Service->>Service: Refresh()
  Handler-->>AdminUI: View / 204

  AdminUI->>Handler: POST /admin/failover/generate
  Handler->>Resolver: SuggestFallbacks(resolution, op)
  Resolver-->>Handler: []ModelSelector
  Handler-->>AdminUI: []View suggestions
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • ENTERPILOT/GoModel#186: Touches the same admin dashboard runtime-flag plumbing and workflow failover UI gating that this PR replaces with FAILOVER_ENABLED.
  • ENTERPILOT/GoModel#247: Introduced overlayDialogOpen() and the dashboard template split that this PR directly extends to add failoverFormOpen.
  • ENTERPILOT/GoModel#236: Created docs/features/failover.mdx which this PR substantially revises to reflect the new manual-only failover configuration model.

Suggested labels

release:feature

Poem

🐇 Hop hop, a new failover page appears,
With rules stored in SQLite, Postgres, and Mongo tiers.
FEATURE_FALLBACK_MODE hops away for good,
FAILOVER_ENABLED takes its place as it should.
The dashboard glows with a brand new icon bright—
Manual rules guide the traffic through the night! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding manual failover management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/failover-management

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.


func (s *MongoDBStore) Get(ctx context.Context, source string) (*Rule, error) {
var rule Rule
err := s.collection.FindOne(ctx, bson.M{"_id": strings.TrimSpace(source)}).Decode(&rule)
"created_at": rule.CreatedAt,
},
}
_, err := s.collection.UpdateOne(ctx, bson.M{"_id": strings.TrimSpace(rule.Source)}, update, options.UpdateOne().SetUpsert(true))
}

func (s *MongoDBStore) Delete(ctx context.Context, source string) error {
result, err := s.collection.DeleteOne(ctx, bson.M{"_id": strings.TrimSpace(source)})
@codecov-commenter

codecov-commenter commented Jun 29, 2026

Copy link
Copy Markdown

@mintlify

mintlify Bot commented Jun 29, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
gomodel 🟢 Ready View Preview Jun 29, 2026, 9:48 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/fallback/resolver_test.go (1)

55-83: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Assert that manual targets never reappear in SuggestFallbacks().

This test now only checks the first two suggestions. If SuggestFallbacks() starts appending the already-configured manual target ("azure/gpt-4o") later in the slice, the test still passes even though internal/fallback/resolver.go:171-189 is explicitly deduping manual selectors before auto suggestions are generated. Add a negative assertion over the full result set. As per coding guidelines, **/*_test.go: Add or update tests for behavior changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/fallback/resolver_test.go` around lines 55 - 83, The fallback
ranking test for Resolver currently only validates the first two suggestions, so
it can miss regressions where the manual target is appended later in the slice.
Update TestResolverSuggestFallbacksReturnsRankingCandidates to assert across the
full result from SuggestFallbacks that the manual selector from
Resolver.NewResolver config (the "azure/gpt-4o" entry) never appears, while
still preserving the existing ordering checks for the top ranked auto
suggestions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/fallback.go`:
- Around line 252-263: The disabled-model JSON parsing currently accepts the
literal null because json.Unmarshal into the slice in the cfg.DisabledModelsJSON
branch succeeds with a nil result; update this parsing path to explicitly reject
null before or after unmarshaling. In the fallback config logic that builds
disabled, keep allowing only a JSON array or boolean object shape, and if the
expanded value is null return an error instead of treating it as valid input.

In `@docs/features/failover.mdx`:
- Around line 13-15: The overview text in the failover documentation has an
awkward, broken sentence that makes the runtime migration guidance unclear.
Update the affected prose in the failover overview to rephrase the runtime
failover guidance into a complete, unambiguous sentence, keeping the meaning
that runtime failover relies on manual rules only. Locate the text near the
introductory failover description and adjust the surrounding sentence flow for
clarity.

In `@docs/openapi.json`:
- Around line 6775-6782: The failover request schemas currently define source as
optional even though the delete and upsert handlers require it; update both
admin.deleteFailoverRuleRequest and admin.upsertFailoverRuleRequest in the
OpenAPI document to mark source as required. Make sure the required field list
matches the server contract so generated clients treat source as mandatory for
both request shapes.
- Around line 952-1003: The failover upsert response contract is missing the
successful 204 No Content path. Update the responses for the PUT /admin/failover
operation in docs/openapi.json to include a 204 entry alongside the existing
200/400/401/502/503 responses, matching the behavior in
internal/admin/handler_failover.go and its refreshed view handling.

In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 83-89: The failover edit flow in failover.js captures
failoverFormOriginalSource but never uses it when saving, so changing source
creates a new persisted rule instead of updating the existing one. Fix the save
path in the form handling logic by either making source immutable in edit mode
or by using failoverFormOriginalSource to delete/replace the old rule before
upserting the edited one; update the relevant edit/save methods around
failoverFormOriginalSource and failoverForm so the original key is handled
correctly.
- Around line 82-88: The failover UI is still using rule.managed to determine
whether a rule is config-managed, but this PR’s contract uses managed_source
instead. Update the failover state setup in the failover module to derive the
managed flag from managed_source, and make sure the template logic for the
edit/delete affordances also checks managed_source consistently so managed rules
are treated as non-editable dashboard entries.

In `@internal/admin/dashboard/templates/layout.html`:
- Around line 162-170: The failover target picker is using a datalist with a
textarea, but the `list` attribute is not supported on `<textarea>`, so the
autocomplete will never work. Update the `failover-target-options` /
`failover-targets` UI in `layout.html` to use a supported control such as an
`<input>` with `list`, or replace it with a custom autocomplete component while
keeping the `failoverForm.targets` binding and `failoverFormManaged` disabled
state intact.

In `@internal/admin/dashboard/templates/page-failover.html`:
- Around line 21-24: The failover filter input is bound to shared state but the
tables still render the unfiltered arrays, so typing has no effect and leaks
into the models page filter. Update the failover template to use a
failover-specific filtered collection in the table loops, and avoid reusing the
shared modelFilter binding by introducing a dedicated failover filter state and
derived list in the surrounding Alpine data/template logic. Ensure both table
sections that currently iterate the raw collections are switched to the new
filtered source.

In `@internal/admin/dashboard/templates/sidebar.html`:
- Around line 43-46: The sidebar Failover nav link is still shown and clickable
even when the feature is unavailable. Update the sidebar template’s Failover
entry to use the same availability check as the other failover entry points by
gating the anchor on failoverAvailable, so it is hidden when the API reports
failover is unavailable.

In `@internal/admin/handler_failover.go`:
- Around line 169-172: The `/admin/failover/generate` endpoint is documented as
returning `failover.View`, but it only fabricates draft data and should not use
the persisted-rule shape. Update the handler/Swagger annotations in
`admin/handler_failover.go` to return a dedicated draft DTO for this endpoint,
and adjust the generation path in the failover handler so it serializes only the
draft fields instead of `created_at`/`updated_at` or other stored-view-only
values.

In `@internal/failover/factory.go`:
- Around line 90-98: The failover factory currently reuses the model-cache
refresh ticker for propagation, which can leave peer nodes stale for too long.
Update the Result construction in the factory logic to use a dedicated, short
failover refresh interval or an explicit invalidation mechanism instead of
cfg.Cache.Model.RefreshInterval and the time.Hour fallback. Keep the change
localized around the service.StartBackgroundRefresh call so failover updates
propagate independently from the model cache.

In `@internal/failover/service_test.go`:
- Around line 60-89: The current test only covers the happy-path merge in
TestServiceConfigRulesOverrideDashboardRules; add regression tests for the new
write/error behavior too. Extend coverage around
Service/Refresh/UpsertRule/ResetDashboardRules to verify config-managed rules
cannot be overwritten by dashboard writes, config rules survive
ResetDashboardRules, and storage failures from refresh/upsert are surfaced as
errors instead of being swallowed. Use the existing symbols NewService, Refresh,
Rules, Get, and the managed-source constants to keep the tests aligned with the
failover contract.

In `@internal/failover/service.go`:
- Around line 183-185: The pre-read in `service.go` is swallowing unexpected
`Store.Get` errors and still allowing the write to continue, which can hide
storage failures and overwrite metadata. Update the `existing, err :=
s.store.Get(ctx, normalized.Source)` check in the rule write path to distinguish
`ErrNotFound` from other errors: keep the current `CreatedAt` preservation only
when a record exists, but immediately return any non-`ErrNotFound` error before
proceeding. Ensure the fix is applied in the same update flow that uses
`normalized.CreatedAt` and `s.store.Get`.

In `@internal/failover/store_mongodb.go`:
- Around line 19-32: NewMongoDBStore creates indexes with context.Background(),
so index creation ignores the caller’s cancellation and shutdown state. Update
the MongoDB initialization path to accept and use the caller context in
NewMongoDBStore, and pass that ctx through to coll.Indexes().CreateMany instead
of creating a new timeout context. Keep the existing error handling and index
definitions intact while aligning the MongoDB flow with the PostgreSQL init
path.

---

Outside diff comments:
In `@internal/fallback/resolver_test.go`:
- Around line 55-83: The fallback ranking test for Resolver currently only
validates the first two suggestions, so it can miss regressions where the manual
target is appended later in the slice. Update
TestResolverSuggestFallbacksReturnsRankingCandidates to assert across the full
result from SuggestFallbacks that the manual selector from Resolver.NewResolver
config (the "azure/gpt-4o" entry) never appears, while still preserving the
existing ordering checks for the top ranked auto suggestions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 849c231b-e921-4bda-ac96-a07c5211f4eb

📥 Commits

Reviewing files that changed from the base of the PR and between 92d4e88 and 73f3f20.

📒 Files selected for processing (36)
  • .env.template
  • config/config.example.yaml
  • config/config.go
  • config/config_test.go
  • config/fallback.go
  • docs/features/failover.mdx
  • docs/openapi.json
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/static/js/modules/workflows-layout.test.cjs
  • internal/admin/dashboard/static/js/modules/workflows.js
  • internal/admin/dashboard/static/js/modules/workflows.test.cjs
  • internal/admin/dashboard/templates/index.html
  • internal/admin/dashboard/templates/layout.html
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/dashboard/templates/page-failover.html
  • internal/admin/dashboard/templates/page-settings.html
  • internal/admin/dashboard/templates/sidebar.html
  • internal/admin/handler.go
  • internal/admin/handler_failover.go
  • internal/admin/handler_test.go
  • internal/admin/routes.go
  • internal/admin/routes_test.go
  • internal/app/app.go
  • internal/app/app_test.go
  • internal/failover/factory.go
  • internal/failover/service.go
  • internal/failover/service_test.go
  • internal/failover/store.go
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/types.go
  • internal/fallback/resolver.go
  • internal/fallback/resolver_test.go
  • internal/server/http_test.go

Comment thread config/fallback.go
Comment on lines +252 to +263
if raw := strings.TrimSpace(cfg.DisabledModelsJSON); raw != "" {
expanded := expandString(raw)
var list []string
if err := json.Unmarshal([]byte(expanded), &list); err == nil {
for _, model := range list {
model = strings.TrimSpace(model)
if model != "" {
disabled[model] = true
}
}
return nilIfEmpty(disabled), nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject null for disabled-model JSON.

json.Unmarshal("null", &list) succeeds with a nil slice, so FAILOVER_DISABLED_MODELS_JSON=null is silently accepted even though this field only allows an array or boolean object.

Proposed fix
 	if raw := strings.TrimSpace(cfg.DisabledModelsJSON); raw != "" {
 		expanded := expandString(raw)
+		if bytes.Equal(bytes.TrimSpace([]byte(expanded)), []byte("null")) {
+			return nil, fmt.Errorf("fallback.disabled_models_json: null is not allowed; must be a JSON array or boolean object")
+		}
 		var list []string
 		if err := json.Unmarshal([]byte(expanded), &list); err == nil {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if raw := strings.TrimSpace(cfg.DisabledModelsJSON); raw != "" {
expanded := expandString(raw)
var list []string
if err := json.Unmarshal([]byte(expanded), &list); err == nil {
for _, model := range list {
model = strings.TrimSpace(model)
if model != "" {
disabled[model] = true
}
}
return nilIfEmpty(disabled), nil
}
if raw := strings.TrimSpace(cfg.DisabledModelsJSON); raw != "" {
expanded := expandString(raw)
if bytes.Equal(bytes.TrimSpace([]byte(expanded)), []byte("null")) {
return nil, fmt.Errorf("fallback.disabled_models_json: null is not allowed; must be a JSON array or boolean object")
}
var list []string
if err := json.Unmarshal([]byte(expanded), &list); err == nil {
for _, model := range list {
model = strings.TrimSpace(model)
if model != "" {
disabled[model] = true
}
}
return nilIfEmpty(disabled), nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/fallback.go` around lines 252 - 263, The disabled-model JSON parsing
currently accepts the literal null because json.Unmarshal into the slice in the
cfg.DisabledModelsJSON branch succeeds with a nil result; update this parsing
path to explicitly reject null before or after unmarshaling. In the fallback
config logic that builds disabled, keep allowing only a JSON array or boolean
object shape, and if the expanded value is null return an error instead of
treating it as valid input.

Comment on lines 13 to 15
When a request fails, GoModel can retry it against alternate models. For
predictable behavior, use manual mode.
runtime failover uses manual rules only.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the broken overview sentence.

Line 14 reads awkwardly as “For runtime failover uses manual rules only.” Please rephrase it so the migration guidance is unambiguous to readers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/features/failover.mdx` around lines 13 - 15, The overview text in the
failover documentation has an awkward, broken sentence that makes the runtime
migration guidance unclear. Update the affected prose in the failover overview
to rephrase the runtime failover guidance into a complete, unambiguous sentence,
keeping the meaning that runtime failover relies on manual rules only. Locate
the text near the introductory failover description and adjust the surrounding
sentence flow for clarity.

Comment thread docs/openapi.json
Comment on lines +952 to +1003
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/failover.View"
}
}
}
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Document the 204 No Content upsert path.

internal/admin/handler_failover.go can return 204 after a successful PUT /admin/failover when the refreshed view is unavailable, but this operation only advertises 200. That makes the published API contract narrower than the handler.

Suggested OpenAPI fix
         "responses": {
           "200": {
             "description": "OK",
             "content": {
               "application/json": {
                 "schema": {
                   "$ref": "`#/components/schemas/failover.View`"
                 }
               }
             }
           },
+          "204": {
+            "description": "No Content"
+          },
           "400": {
             "description": "Bad Request",
             "content": {
               "application/json": {
                 "schema": {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/failover.View"
}
}
}
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/core.GatewayError"
}
}
}
}
},
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/failover.View`"
}
}
}
},
"204": {
"description": "No Content"
},
"400": {
"description": "Bad Request",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"401": {
"description": "Unauthorized",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"502": {
"description": "Bad Gateway",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
},
"503": {
"description": "Service Unavailable",
"content": {
"application/json": {
"schema": {
"$ref": "`#/components/schemas/core.GatewayError`"
}
}
}
}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 952 - 1003, The failover upsert response
contract is missing the successful 204 No Content path. Update the responses for
the PUT /admin/failover operation in docs/openapi.json to include a 204 entry
alongside the existing 200/400/401/502/503 responses, matching the behavior in
internal/admin/handler_failover.go and its refreshed view handling.

Comment thread docs/openapi.json
Comment on lines +6775 to +6782
"admin.deleteFailoverRuleRequest": {
"type": "object",
"properties": {
"source": {
"type": "string"
}
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Mark source as required in both failover request schemas.

Both the delete and upsert handlers reject missing or blank source, but these schemas leave it optional. That weakens generated clients and misstates the server contract.

Suggested OpenAPI fix
       "admin.deleteFailoverRuleRequest": {
         "type": "object",
+        "required": [
+          "source"
+        ],
         "properties": {
           "source": {
             "type": "string"
           }
         }
       },
@@
       "admin.upsertFailoverRuleRequest": {
         "type": "object",
+        "required": [
+          "source"
+        ],
         "properties": {
           "description": {
             "type": "string"
           },

Also applies to: 6957-6976

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 6775 - 6782, The failover request schemas
currently define source as optional even though the delete and upsert handlers
require it; update both admin.deleteFailoverRuleRequest and
admin.upsertFailoverRuleRequest in the OpenAPI document to mark source as
required. Make sure the required field list matches the server contract so
generated clients treat source as mandatory for both request shapes.

Comment on lines +82 to +88
this.failoverFormManaged = Boolean(rule.managed);
this.failoverFormOriginalSource = rule.source || '';
this.failoverForm = {
source: rule.source || '',
targets: (Array.isArray(rule.targets) ? rule.targets : []).join('\n'),
description: rule.description || '',
enabled: rule.enabled !== false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use managed_source consistently for managed-rule state.

This module still keys off rule.managed, but the view contract in this PR uses managed_source. As written, config-managed rules will be labeled as normal dashboard rules here, and the same mismatch leaks into the template’s edit/delete affordances. Gate the UI off managed_source instead.

Also applies to: 297-300

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 82 - 88,
The failover UI is still using rule.managed to determine whether a rule is
config-managed, but this PR’s contract uses managed_source instead. Update the
failover state setup in the failover module to derive the managed flag from
managed_source, and make sure the template logic for the edit/delete affordances
also checks managed_source consistently so managed rules are treated as
non-editable dashboard entries.

Comment on lines +169 to +172
// @Success 200 {array} failover.View
// @Failure 401 {object} core.GatewayError
// @Failure 503 {object} core.GatewayError
// @Router /admin/failover/generate [post]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Don't return draft suggestions as failover.View.

failover.View is the persisted-rule schema, but this endpoint fabricates only a subset of those fields. The generated payload will serialize zero created_at/updated_at values and a stored-view shape for objects that have not been saved yet. Please switch /admin/failover/generate to a dedicated typed draft DTO instead of reusing failover.View.

Suggested direction
+type generateFailoverRuleResponse struct {
+	Source      string   `json:"source"`
+	Targets     []string `json:"targets"`
+	Description string   `json:"description,omitempty"`
+	Enabled     bool     `json:"enabled"`
+}
+
 func (h *Handler) GenerateFailoverRules(c *echo.Context) error {
 	...
-	suggestions := make([]failover.View, 0)
+	suggestions := make([]generateFailoverRuleResponse, 0)
 	...
-		suggestions = append(suggestions, failover.View{
-			Source:        source,
-			Targets:       targets,
-			Enabled:       true,
-			ManagedSource: failover.ManagedSourceDashboard,
-		})
+		suggestions = append(suggestions, generateFailoverRuleResponse{
+			Source:  source,
+			Targets: targets,
+			Enabled: true,
+		})
 	}
 	return c.JSON(http.StatusOK, suggestions)
 }

Also applies to: 207-212

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_failover.go` around lines 169 - 172, The
`/admin/failover/generate` endpoint is documented as returning `failover.View`,
but it only fabricates draft data and should not use the persisted-rule shape.
Update the handler/Swagger annotations in `admin/handler_failover.go` to return
a dedicated draft DTO for this endpoint, and adjust the generation path in the
failover handler so it serializes only the draft fields instead of
`created_at`/`updated_at` or other stored-view-only values.

Comment on lines +90 to +98
refreshInterval := time.Duration(cfg.Cache.Model.RefreshInterval) * time.Second
if refreshInterval <= 0 {
refreshInterval = time.Hour
}
return &Result{
Service: service,
Store: store,
stopRefresh: service.StartBackgroundRefresh(refreshInterval),
}, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Don't bind failover propagation to the model-cache refresh interval.

Dashboard writes refresh only the local process. Every other instance has to wait for this ticker, and the current fallback to time.Hour means a manual failover change can stay stale on peer nodes for up to an hour. For an incident-response control, that lag is too high; give failover its own short refresh interval or an explicit invalidation path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/factory.go` around lines 90 - 98, The failover factory
currently reuses the model-cache refresh ticker for propagation, which can leave
peer nodes stale for too long. Update the Result construction in the factory
logic to use a dedicated, short failover refresh interval or an explicit
invalidation mechanism instead of cfg.Cache.Model.RefreshInterval and the
time.Hour fallback. Keep the change localized around the
service.StartBackgroundRefresh call so failover updates propagate independently
from the model cache.

Comment on lines +60 to +89
func TestServiceConfigRulesOverrideDashboardRules(t *testing.T) {
store := newMemoryStore(Rule{
Source: "gpt-4o",
Targets: []string{"openrouter/gpt-4o"},
Enabled: true,
ManagedSource: ManagedSourceDashboard,
})
service, err := NewService(store, config.FallbackConfig{
Enabled: true,
Manual: map[string][]string{
"gpt-4o": {"azure/gpt-4o"},
},
})
if err != nil {
t.Fatalf("NewService() error = %v", err)
}
if err := service.Refresh(context.Background()); err != nil {
t.Fatalf("Refresh() error = %v", err)
}

got := service.Rules()["gpt-4o"]
want := []string{"azure/gpt-4o"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("Rules()[gpt-4o] = %v, want %v", got, want)
}
view, ok := service.Get("gpt-4o")
if !ok || view.ManagedSource != ManagedSourceConfig {
t.Fatalf("Get(gpt-4o) = %+v, %v; want config-managed rule", view, ok)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Add regression coverage for the new write and error paths.

This only exercises the happy-path merge. The new contract also depends on rejecting config-managed writes, preserving config rules across ResetDashboardRules, and surfacing refresh/upsert storage failures correctly.

As per coding guidelines, **/*_test.go: “Add or update tests for behavior changes, and ensure tests cover ... error handling, default configuration ...”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/service_test.go` around lines 60 - 89, The current test
only covers the happy-path merge in
TestServiceConfigRulesOverrideDashboardRules; add regression tests for the new
write/error behavior too. Extend coverage around
Service/Refresh/UpsertRule/ResetDashboardRules to verify config-managed rules
cannot be overwritten by dashboard writes, config rules survive
ResetDashboardRules, and storage failures from refresh/upsert are surfaced as
errors instead of being swallowed. Use the existing symbols NewService, Refresh,
Rules, Get, and the managed-source constants to keep the tests aligned with the
failover contract.

Source: Coding guidelines

Comment on lines +183 to +185
if existing, err := s.store.Get(ctx, normalized.Source); err == nil && existing != nil {
normalized.CreatedAt = existing.CreatedAt
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Propagate unexpected Store.Get failures before writing.

A non-ErrNotFound read error is silently ignored here, so an update can still succeed after a failed pre-read and overwrite metadata as if the rule were new. That both hides a storage fault from the admin API and can reset CreatedAt on existing rows.

Suggested fix
 import (
 	"context"
+	"errors"
 	"fmt"
 	"log/slog"
 	"sort"
@@
-	if existing, err := s.store.Get(ctx, normalized.Source); err == nil && existing != nil {
+	existing, err := s.store.Get(ctx, normalized.Source)
+	if err != nil && !errors.Is(err, ErrNotFound) {
+		return fmt.Errorf("get failover rule: %w", err)
+	}
+	if err == nil && existing != nil {
 		normalized.CreatedAt = existing.CreatedAt
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if existing, err := s.store.Get(ctx, normalized.Source); err == nil && existing != nil {
normalized.CreatedAt = existing.CreatedAt
}
import (
"context"
"errors"
"fmt"
"log/slog"
"sort"
)
existing, err := s.store.Get(ctx, normalized.Source)
if err != nil && !errors.Is(err, ErrNotFound) {
return fmt.Errorf("get failover rule: %w", err)
}
if err == nil && existing != nil {
normalized.CreatedAt = existing.CreatedAt
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/service.go` around lines 183 - 185, The pre-read in
`service.go` is swallowing unexpected `Store.Get` errors and still allowing the
write to continue, which can hide storage failures and overwrite metadata.
Update the `existing, err := s.store.Get(ctx, normalized.Source)` check in the
rule write path to distinguish `ErrNotFound` from other errors: keep the current
`CreatedAt` preservation only when a record exists, but immediately return any
non-`ErrNotFound` error before proceeding. Ensure the fix is applied in the same
update flow that uses `normalized.CreatedAt` and `s.store.Get`.

Comment thread internal/failover/store_mongodb.go
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown

Confidence Score: 3/5

Two behavioral issues exist in the changed runtime paths: one affects traffic routing for existing deployments that previously disabled fallback via deprecated default_mode: off, and one creates a window where stale dynamic snapshots can overwrite authoritative static config rules in the resolver.

The backward-compatibility gap — FEATURE_FALLBACK_MODE=off no longer disables fallback with no automatic migration to FAILOVER_ENABLED=false — could silently enable fallback for operators who deliberately turned it off while keeping manual rules configured. Separately, effectiveManualRules() applies the dynamic snapshot last, so a briefly stale in-memory view can overwrite static config entries for the same key. Both issues touch the hot request routing path.

config/fallback.go (backward-compat migration), internal/fallback/resolver.go (effectiveManualRules merge order), and internal/failover/service.go (Upsert error handling).

Reviews (1): Last reviewed commit: "feat: add manual failover management" | Re-trigger Greptile

Comment thread config/fallback.go
Comment on lines 90 to 94
}

cfg.DefaultMode = ResolveFallbackDefaultMode(cfg.DefaultMode)
if !cfg.DefaultMode.Valid() {
return fmt.Errorf("fallback.default_mode must be one of: auto, manual, off")
}

if len(cfg.Overrides) > 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 default_mode: off no longer disables failover — silent behavior change on upgrade

FEATURE_FALLBACK_MODE=off / default_mode: off was the documented way to globally disable fallback. The old resolver gate was if mode == FallbackModeOff && len(cfg.Manual) == 0 && len(cfg.Overrides) == 0 { return nil }, so with manual rules present, the resolver existed but modeFor() returned FallbackModeOff for every model that lacked an explicit mode: manual override — those models got no fallbacks.

After this PR, Enabled defaults to true and DefaultMode is silently ignored at runtime. Any operator who had default_mode: off (or FEATURE_FALLBACK_MODE=off) together with a populated manual_rules_path (or FALLBACK_MANUAL_RULES_PATH) will silently receive fallbacks for all those models on the next deploy.

loadFallbackConfig processes the deprecated DefaultMode field but never maps FallbackModeOff → Enabled=false. Either add that migration or update the deprecation notice and migration docs to call out this specific case explicitly.

Comment on lines +175 to +195
requiredCategory := requiredCategoryForOperation(op)
if requiredCategory == core.CategoryEmbedding {
return nil
}
source := r.sourceModelInfo(resolution)
if r.disabledFor(resolution, source) {
return nil
}
sourceKey := r.sourceKey(resolution, source)
seen := make(map[string]struct{})
for _, selector := range r.manualSelectorsFor(resolution, source, sourceKey, seen) {
seen[selector.QualifiedModel()] = struct{}{}
}
return r.autoSelectorsFor(source, sourceKey, requiredCategory, seen)
}

func (r *Resolver) effectiveManualRules() map[string][]string {
if r.ruleProvider == nil {
return r.manual
}
dynamic := r.ruleProvider.Rules()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dynamic rules silently override static config rules in effectiveManualRules()

In effectiveManualRules(), r.manual is written first then dynamic overwrites every shared key. If the service's in-memory state is stale during a background refresh, the dynamic map may not yet carry the correct config values and would overwrite the authoritative static snapshot in r.manual. Consider reversing the merge order so r.manual wins for any key it already owns.

Comment on lines +162 to +170
source = strings.TrimSpace(source)
for _, row := range s.List() {
if row.Source == source {
return &row, true
}
}
return nil, false
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent error swallowing in Upsert when preserving CreatedAt

When s.store.Get returns an error other than ErrNotFound, the condition is false and CreatedAt is silently left at zero. stampUpsert then sets it to now, effectively resetting created_at on every transient DB failure. Distinguish ErrNotFound from other errors and propagate the latter.

Comment thread internal/admin/handler_failover.go Outdated
Comment on lines +206 to +240
}
suggestions = append(suggestions, failover.View{
Source: source,
Targets: targets,
Enabled: true,
ManagedSource: failover.ManagedSourceDashboard,
})
}
return c.JSON(http.StatusOK, suggestions)
}

func modelSupportsCategory(meta *core.ModelMetadata, category core.ModelCategory) bool {
if meta == nil || len(meta.Categories) == 0 {
return true
}
for _, candidate := range meta.Categories {
if candidate == category {
return true
}
}
return false
}

func (h *Handler) findFailoverView(source string) (failover.View, bool) {
for _, view := range h.failoverRules.ListViews() {
if view.Source == source {
return view, true
}
}
return failover.View{}, false
}

func failoverWriteError(err error) error {
if errors.Is(err, failover.ErrManaged) {
return core.NewInvalidRequestError("failover rule is managed by configuration and cannot be changed in the dashboard", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GenerateFailoverRules is O(N²) with no rate-limit protection

The handler calls resolver.SuggestFallbacks() per model, and SuggestFallbacks internally iterates the full registry for each call. For a large registry this can be a sustained CPU spike with no timeout or rate-limit guard on the endpoint.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/admin/dashboard/static/js/modules/failover.js (1)

34-37: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Clear generated drafts when the failover feature becomes unavailable.

On the 503 and fetch-error paths, this only clears failoverRules. page-failover.html still renders failoverGeneratedRules independently, so stale generated drafts remain visible and reopenable after the backend reports the feature as unavailable. Reset failoverGeneratedRules in these early-return paths too.

Also applies to: 49-53

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 34 - 37,
The failover unavailable paths in failover.js only clear failoverRules, leaving
stale drafts in failoverGeneratedRules visible in page-failover.html. Update the
early-return branches in the fetch response handling (including the 503 and
fetch-error paths) to also reset failoverGeneratedRules alongside
failoverAvailable and failoverRules. Use the failoverGeneratedRules state in the
failover.js module as the target to clear so the UI no longer shows reopenable
generated drafts after the feature becomes unavailable.
internal/admin/handler_failover.go (1)

69-83: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Normalize fallback_models before storing them.

primary_model is trimmed and validated, but each fallback selector is persisted verbatim. A payload like [" azure/gpt-4o ", ""] will survive this PUT and store selectors that will never match later. Trim each entry and reject blank targets before building the rule.

Suggested fix
 	source := strings.TrimSpace(req.PrimaryModel)
 	if source == "" {
 		return handleError(c, core.NewInvalidRequestError("primary_model is required", nil))
 	}
+	targets := make([]string, 0, len(req.FallbackModels))
+	for _, raw := range req.FallbackModels {
+		target := strings.TrimSpace(raw)
+		if target == "" {
+			return handleError(c, core.NewInvalidRequestError("fallback_models must not contain blank entries", nil))
+		}
+		targets = append(targets, target)
+	}
 	enabled := true
 	if existing, ok := h.failoverRules.Get(source); ok && existing != nil {
 		enabled = existing.Enabled
 	}
@@
 	rule := failover.Rule{
 		Source:  source,
-		Targets: req.FallbackModels,
+		Targets: targets,
 		Enabled: enabled,
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_failover.go` around lines 69 - 83, The failover rule
creation path in handler_failover.go trims and validates primary_model, but
stores req.FallbackModels unchanged in failover.Rule.Targets. Normalize each
fallback model before constructing the rule by trimming whitespace in the same
handler logic, and reject any blank entries with an invalid request error so
only valid selectors are persisted.
docs/features/failover.mdx (1)

42-52: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Document the disable-list controls too.

This page explains manual rule sources, but it never mentions fallback.disabled_models / FAILOVER_DISABLED_MODELS_JSON, even though both are now part of the exposed config surface in config/config.example.yaml and .env.template. Without that, operators have no docs path for opting specific selectors out of failover.

Suggested addition
 For env-only deployments, use inline JSON:
@@
 `fallback.default_mode` and `FEATURE_FALLBACK_MODE` are deprecated
 compatibility inputs. They are accepted but ignored by runtime failover.
+
+To disable failover for specific primary model selectors, use
+`fallback.disabled_models` in YAML or `FAILOVER_DISABLED_MODELS_JSON` in env-only
+deployments. Example:
+
+```bash
+FAILOVER_DISABLED_MODELS_JSON='["claude-sonnet-4"]'
+```
 
 The Dashboard can generate suggested mappings from model metadata, but
 suggestions must be saved as manual mappings before they affect traffic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/features/failover.mdx` around lines 42 - 52, The failover docs in the
failover page are missing the disable-list configuration, so add coverage for
fallback.disabled_models and FAILOVER_DISABLED_MODELS_JSON alongside the
existing rule source examples. Update the documentation near the inline JSON /
environment variable examples to show how operators can opt specific selectors
out of failover, and mention that this is part of the supported config surface
just like the manual mappings described in the page.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 26-29: The PostgreSQL failover store DDL only covers fresh
installs, so upgraded databases with the old source/targets columns will still
break when FailoverStore reads or upserts failover_rules. Update the store
initialization/migration path in FailoverStore or its schema setup to detect the
legacy layout and run a real ALTER TABLE migration that renames source to
primary_model and targets to fallback_models before any queries use the new
names.

In `@internal/failover/store_sqlite.go`:
- Around line 21-24: The failover_rules initializer in store_sqlite.go only
creates the new table shape for fresh databases, so existing SQLite installs
will keep the old source/targets schema and break later queries. Update the
initialization/migration path in the failover store setup to detect and migrate
the existing failover_rules table before any SELECT or upsert logic runs,
backfilling/renaming source and targets into primary_model and fallback_models
so the new column names are always available.

In `@internal/failover/types.go`:
- Line 17: The BSON tag change on Targets in the failover types breaks decoding
of legacy failover_rules documents that still persist targets, so
MongoDBStore.List/Get will return empty Targets and upserts may rewrite them
under fallback_models. Add backward-compatible read handling in the failover
decoding path (or a migration in MongoDBStore) so both targets and
fallback_models are accepted, and only switch the persisted tag after existing
documents can be read reliably.

---

Outside diff comments:
In `@docs/features/failover.mdx`:
- Around line 42-52: The failover docs in the failover page are missing the
disable-list configuration, so add coverage for fallback.disabled_models and
FAILOVER_DISABLED_MODELS_JSON alongside the existing rule source examples.
Update the documentation near the inline JSON / environment variable examples to
show how operators can opt specific selectors out of failover, and mention that
this is part of the supported config surface just like the manual mappings
described in the page.

In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 34-37: The failover unavailable paths in failover.js only clear
failoverRules, leaving stale drafts in failoverGeneratedRules visible in
page-failover.html. Update the early-return branches in the fetch response
handling (including the 503 and fetch-error paths) to also reset
failoverGeneratedRules alongside failoverAvailable and failoverRules. Use the
failoverGeneratedRules state in the failover.js module as the target to clear so
the UI no longer shows reopenable generated drafts after the feature becomes
unavailable.

In `@internal/admin/handler_failover.go`:
- Around line 69-83: The failover rule creation path in handler_failover.go
trims and validates primary_model, but stores req.FallbackModels unchanged in
failover.Rule.Targets. Normalize each fallback model before constructing the
rule by trimming whitespace in the same handler logic, and reject any blank
entries with an invalid request error so only valid selectors are persisted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 111eae64-2273-418b-ba43-1ec136d2ce31

📥 Commits

Reviewing files that changed from the base of the PR and between 73f3f20 and b1bf8d4.

📒 Files selected for processing (14)
  • .env.template
  • config/config.example.yaml
  • docs/features/failover.mdx
  • docs/openapi.json
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/templates/layout.html
  • internal/admin/dashboard/templates/page-failover.html
  • internal/admin/handler_failover.go
  • internal/failover/service.go
  • internal/failover/store.go
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/types.go

Comment thread internal/failover/store_postgresql.go
Comment thread internal/failover/store_sqlite.go
// Rule is one manual failover mapping for a primary model selector.
type Rule struct {
Source string `json:"primary_model" bson:"_id"`
Targets []string `json:"fallback_models" bson:"fallback_models"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep backward-compatible BSON decoding for existing failover documents.

Changing Targets to bson:"fallback_models" means legacy failover_rules documents that still store targets will decode with empty Targets in MongoDBStore.List/Get, and the next upsert will rewrite them under the new field name. Add a migration or dual-read compatibility path before switching the persisted tag.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/types.go` at line 17, The BSON tag change on Targets in the
failover types breaks decoding of legacy failover_rules documents that still
persist targets, so MongoDBStore.List/Get will return empty Targets and upserts
may rewrite them under fallback_models. Add backward-compatible read handling in
the failover decoding path (or a migration in MongoDBStore) so both targets and
fallback_models are accepted, and only switch the persisted tag after existing
documents can be read reliably.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/failover/store_postgresql.go`:
- Around line 40-63: Scope the migration checks in the failover PostgreSQL
migration so they only inspect the active schema. In the migration logic that
renames failover_rules columns in store_postgresql.go, update each
information_schema.columns existence check to include table_schema =
current_schema() alongside table_name and column_name, so the ALTER TABLE
failover_rules statements only depend on columns in the schema being migrated.

In `@internal/failover/store_sqlite_test.go`:
- Around line 51-64: The test in store_sqlite_test.go currently verifies only
Source and Targets, so add assertions for the migrated metadata fields that the
rule maps as well. Update the existing List() result checks in the relevant test
case to validate enabled, managed_source, created_at, and updated_at on the
returned row from store.List, using the same row variable already referenced in
the test so regressions in metadata preservation fail here too.

In `@internal/failover/store_sqlite.go`:
- Around line 104-113: The SQLite migration in the failover store is selecting
the legacy primary key value without normalizing it, so rows can be inserted
with leading/trailing spaces even though lookups trim inputs. Update the
migration query in the failover store logic (the code building the INSERT/SELECT
for failover_rules in store_sqlite.go) so the primary key expression is TRIMmed
before being stored, matching the normalization used by Get and Delete and the
TRIM filter already applied in the source query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0dd8c12f-4b04-4ba5-a223-ce262e5afac1

📥 Commits

Reviewing files that changed from the base of the PR and between b1bf8d4 and a9e37d2.

📒 Files selected for processing (6)
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/dashboard/templates/shuffle-icon.html
  • internal/failover/store_mongodb.go
  • internal/failover/store_postgresql.go
  • internal/failover/store_sqlite.go
  • internal/failover/store_sqlite_test.go

Comment on lines +40 to +63
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'failover_rules' AND column_name = 'source'
) AND NOT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'failover_rules' AND column_name = 'primary_model'
) THEN
ALTER TABLE failover_rules RENAME COLUMN source TO primary_model;
END IF;
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'failover_rules' AND column_name = 'targets'
) AND NOT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'failover_rules' AND column_name = 'fallback_models'
) THEN
ALTER TABLE failover_rules RENAME COLUMN targets TO fallback_models;
END IF;
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'failover_rules' AND column_name = 'description'
) THEN
ALTER TABLE failover_rules DROP COLUMN description;
END IF;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

git ls-files internal/failover/store_postgresql.go && wc -l internal/failover/store_postgresql.go && sed -n '1,140p' internal/failover/store_postgresql.go

Repository: ENTERPILOT/GoModel

Length of output: 4263


🏁 Script executed:

rg -n "failover_rules|search_path|current_schema|table_schema" -S .

Repository: ENTERPILOT/GoModel

Length of output: 6112


Scope the migration checks to the active schema. The information_schema.columns probes should filter on table_schema = current_schema() so another schema’s failover_rules table can’t make this migration skip renaming source/targets on the table ALTER TABLE failover_rules will actually touch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/store_postgresql.go` around lines 40 - 63, Scope the
migration checks in the failover PostgreSQL migration so they only inspect the
active schema. In the migration logic that renames failover_rules columns in
store_postgresql.go, update each information_schema.columns existence check to
include table_schema = current_schema() alongside table_name and column_name, so
the ALTER TABLE failover_rules statements only depend on columns in the schema
being migrated.

Comment on lines +51 to +64
rows, err := store.List(context.Background())
if err != nil {
t.Fatalf("List() error = %v", err)
}
if len(rows) != 1 {
t.Fatalf("List() returned %d rows, want 1", len(rows))
}
if rows[0].Source != "gpt-4o" {
t.Fatalf("row source = %q, want gpt-4o", rows[0].Source)
}
wantTargets := []string{"azure/gpt-4o", "gemini/gemini-2.5-pro"}
if !reflect.DeepEqual(rows[0].Targets, wantTargets) {
t.Fatalf("row targets = %v, want %v", rows[0].Targets, wantTargets)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the metadata fields migrated with the rule.

This test catches source/targets regressions, but not enabled, managed_source, created_at, or updated_at, even though the migration explicitly maps those columns. Add assertions so metadata-preservation regressions fail here. As per coding guidelines, **/*_test.go: Add or update tests for behavior changes.

Proposed test assertions
 	if !reflect.DeepEqual(rows[0].Targets, wantTargets) {
 		t.Fatalf("row targets = %v, want %v", rows[0].Targets, wantTargets)
 	}
+	if !rows[0].Enabled {
+		t.Fatalf("row enabled = false, want true")
+	}
+	if rows[0].ManagedSource != ManagedSourceDashboard {
+		t.Fatalf("row managed source = %q, want %q", rows[0].ManagedSource, ManagedSourceDashboard)
+	}
+	if got := rows[0].CreatedAt.Unix(); got != 100 {
+		t.Fatalf("row created_at = %d, want 100", got)
+	}
+	if got := rows[0].UpdatedAt.Unix(); got != 200 {
+		t.Fatalf("row updated_at = %d, want 200", got)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rows, err := store.List(context.Background())
if err != nil {
t.Fatalf("List() error = %v", err)
}
if len(rows) != 1 {
t.Fatalf("List() returned %d rows, want 1", len(rows))
}
if rows[0].Source != "gpt-4o" {
t.Fatalf("row source = %q, want gpt-4o", rows[0].Source)
}
wantTargets := []string{"azure/gpt-4o", "gemini/gemini-2.5-pro"}
if !reflect.DeepEqual(rows[0].Targets, wantTargets) {
t.Fatalf("row targets = %v, want %v", rows[0].Targets, wantTargets)
}
rows, err := store.List(context.Background())
if err != nil {
t.Fatalf("List() error = %v", err)
}
if len(rows) != 1 {
t.Fatalf("List() returned %d rows, want 1", len(rows))
}
if rows[0].Source != "gpt-4o" {
t.Fatalf("row source = %q, want gpt-4o", rows[0].Source)
}
wantTargets := []string{"azure/gpt-4o", "gemini/gemini-2.5-pro"}
if !reflect.DeepEqual(rows[0].Targets, wantTargets) {
t.Fatalf("row targets = %v, want %v", rows[0].Targets, wantTargets)
}
if !rows[0].Enabled {
t.Fatalf("row enabled = false, want true")
}
if rows[0].ManagedSource != ManagedSourceDashboard {
t.Fatalf("row managed source = %q, want %q", rows[0].ManagedSource, ManagedSourceDashboard)
}
if got := rows[0].CreatedAt.Unix(); got != 100 {
t.Fatalf("row created_at = %d, want 100", got)
}
if got := rows[0].UpdatedAt.Unix(); got != 200 {
t.Fatalf("row updated_at = %d, want 200", got)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/store_sqlite_test.go` around lines 51 - 64, The test in
store_sqlite_test.go currently verifies only Source and Targets, so add
assertions for the migrated metadata fields that the rule maps as well. Update
the existing List() result checks in the relevant test case to validate enabled,
managed_source, created_at, and updated_at on the returned row from store.List,
using the same row variable already referenced in the test so regressions in
metadata preservation fail here too.

Source: Coding guidelines

Comment on lines +104 to +113
SELECT
%s,
%s,
%s,
%s,
%s,
%s
FROM failover_rules
WHERE TRIM(COALESCE(%s, '')) <> ''
`, primaryExpr, targetsExpr, enabledExpr, managedSourceExpr, createdAtExpr, updatedAtExpr, primaryExpr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Trim the migrated primary key value before storing it.

Line 112 filters with TRIM(...), but Line 105 inserts the raw legacy value. A legacy source like " gpt-4o " will be migrated with spaces, while Get/Delete trim input and then miss the row.

Proposed fix
 		SELECT
-			%s,
+			TRIM(%s),
 			%s,
 			%s,
 			%s,
 			%s,
 			%s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SELECT
%s,
%s,
%s,
%s,
%s,
%s
FROM failover_rules
WHERE TRIM(COALESCE(%s, '')) <> ''
`, primaryExpr, targetsExpr, enabledExpr, managedSourceExpr, createdAtExpr, updatedAtExpr, primaryExpr)
SELECT
TRIM(%s),
%s,
%s,
%s,
%s,
%s
FROM failover_rules
WHERE TRIM(COALESCE(%s, '')) <> ''
`, primaryExpr, targetsExpr, enabledExpr, managedSourceExpr, createdAtExpr, updatedAtExpr, primaryExpr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/failover/store_sqlite.go` around lines 104 - 113, The SQLite
migration in the failover store is selecting the legacy primary key value
without normalizing it, so rows can be inserted with leading/trailing spaces
even though lookups trim inputs. Update the migration query in the failover
store logic (the code building the INSERT/SELECT for failover_rules in
store_sqlite.go) so the primary key expression is TRIMmed before being stored,
matching the normalization used by Get and Delete and the TRIM filter already
applied in the source query.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/admin/dashboard/static/js/modules/failover.js (2)

77-82: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use managed_source for managed-rule UI state.

The API/view contract here exposes managed_source, not managed, so config-managed mappings are still treated as editable dashboard rules and shown as On instead of Config. Normalize the managed flag from managed_source before openFailoverEdit() and failoverRuleStatus() consume it.

Also applies to: 363-375

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 77 - 82,
The failover rule UI is reading the wrong field for managed state, so
config-managed mappings are treated like editable dashboard rules. Update the
failover flow to normalize and use managed_source instead of managed before
openFailoverEdit() sets form state and before failoverRuleStatus() renders the
status. Make sure the rule object consistently derives failoverFormManaged and
the displayed label from managed_source so managed rules show as Config rather
than On.

77-85: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Renaming the primary model in edit mode creates a second rule.

failoverFormOriginalSource is captured but never used on save. The admin API upserts by primary_model, so editing source leaves the old mapping active and persists a new one. Make source immutable while editing, or delete/replace the original key before the PUT.

Also applies to: 154-192

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/failover.js` around lines 77 - 85,
The edit flow in openFailoverEdit and the save path in failover.js are allowing
primary model renames to create a second rule because failoverFormOriginalSource
is captured but never applied. Update the edit/save logic so the original
primary_model key is preserved or explicitly deleted/replaced before the PUT,
and make the source field immutable while failoverFormMode is edit. Ensure the
code paths that build the payload and submit the form use
failoverFormOriginalSource consistently when editing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/admin/dashboard/static/js/modules/failover.js`:
- Around line 77-82: The failover rule UI is reading the wrong field for managed
state, so config-managed mappings are treated like editable dashboard rules.
Update the failover flow to normalize and use managed_source instead of managed
before openFailoverEdit() sets form state and before failoverRuleStatus()
renders the status. Make sure the rule object consistently derives
failoverFormManaged and the displayed label from managed_source so managed rules
show as Config rather than On.
- Around line 77-85: The edit flow in openFailoverEdit and the save path in
failover.js are allowing primary model renames to create a second rule because
failoverFormOriginalSource is captured but never applied. Update the edit/save
logic so the original primary_model key is preserved or explicitly
deleted/replaced before the PUT, and make the source field immutable while
failoverFormMode is edit. Ensure the code paths that build the payload and
submit the form use failoverFormOriginalSource consistently when editing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 998194ca-8d28-4db5-8420-b3e20e6a23fe

📥 Commits

Reviewing files that changed from the base of the PR and between a9e37d2 and 17dfa09.

📒 Files selected for processing (5)
  • internal/admin/dashboard/static/css/dashboard.css
  • internal/admin/dashboard/static/js/modules/failover.js
  • internal/admin/dashboard/static/js/modules/failover.test.cjs
  • internal/admin/dashboard/templates/failover-icon.html
  • internal/admin/dashboard/templates/model-table-body.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants