Skip to content

Add crash-safe migration for legacy system secret keys#4346

Open
amirejaz wants to merge 12 commits intomainfrom
phase4-migration
Open

Add crash-safe migration for legacy system secret keys#4346
amirejaz wants to merge 12 commits intomainfrom
phase4-migration

Conversation

@amirejaz
Copy link
Copy Markdown
Contributor

Summary

  • Existing users may have registry tokens and workload auth secrets stored under bare keys (BEARER_TOKEN_, OAUTH_CLIENT_SECRET_, REGISTRY_OAUTH_, etc.) that pre-date the scoped provider wrappers. Without migration, those keys would be unreachable after Phase 3 wires callers to use ScopeWorkloads/ScopeRegistry providers.
  • Adds MigrateSystemKeys and DiscoverMigrations in pkg/secrets/migration.go: discovers bare system keys by known prefix and renames them into the __thv_<scope>_ namespace using write-before-delete ordering for crash safety.
  • Adds CheckAndPerformSecretScopeMigration in pkg/migration/secret_scope.go: guards migration behind a SecretScopeMigration config flag (same pattern as existing migrations) and is triggered from cmd/thv/main.go alongside the other startup migrations.

This is Phase 4 of the scoped secret store (part of #4192), tracking issue #4226. It is stacked on Phase 3 (#4343) and must be released at the same time.

Release note: This PR and the caller-wiring PR (#4343) must be merged before releasing. Neither is a breaking change on its own when no release occurs between the two merges.

Type of change

  • New feature (non-breaking change which adds functionality)

Test plan

  • go build ./... — clean build
  • go test ./pkg/secrets/... ./pkg/migration/... — all pass, including TestMigrateSystemKeys and TestDiscoverMigrations
  • golangci-lint run — 0 issues

Special notes for reviewers

Key design properties of the migration:

  • Write-before-delete: new key is written before the old is deleted, so a crash mid-migration leaves the secret reachable under the new name on retry.
  • Idempotent: if the old key is absent (already migrated or never existed), GetSecret returns a not-found error and the entry is silently skipped.
  • Discovery-based: DiscoverMigrations lists all secrets and matches against SystemKeyPrefixMappings; no static workload registry is required.
  • One-shot: the SecretScopeMigration config flag prevents re-running on every startup after migration completes.
  • Raw provider: migration uses CreateSecretProvider (bare, no wrapper) so it can enumerate and rename both old bare keys and already-scoped __thv_* keys without restriction.

Generated with Claude Code

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 24, 2026
@amirejaz amirejaz marked this pull request as draft March 24, 2026 14:55
@amirejaz amirejaz force-pushed the phase3-wire-callers branch from 586f764 to 516c58c Compare March 26, 2026 10:47
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 26, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 47.54098% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.58%. Comparing base (03ae494) to head (8d2942e).

Files with missing lines Patch % Lines
pkg/migration/secret_scope.go 0.00% 32 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4346      +/-   ##
==========================================
- Coverage   69.61%   69.58%   -0.04%     
==========================================
  Files         494      496       +2     
  Lines       50457    50518      +61     
==========================================
+ Hits        35126    35151      +25     
- Misses      12635    12669      +34     
- Partials     2696     2698       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@amirejaz amirejaz force-pushed the phase3-wire-callers branch from 516c58c to a943793 Compare March 26, 2026 12:41
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 26, 2026
@amirejaz amirejaz marked this pull request as ready for review March 26, 2026 13:54
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 26, 2026
DefaultGroupMigration bool `yaml:"default_group_migration,omitempty"`
TelemetryConfigMigration bool `yaml:"telemetry_config_migration,omitempty"`
MiddlewareTelemetryMigration bool `yaml:"middleware_telemetry_migration,omitempty"`
SecretScopeMigration bool `yaml:"secret_scope_migration,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider in the future something like Migrations map[string]bool?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea — a Migrations map[string]bool would let us add future one-shot migrations without touching the struct shape each time. Flagging as a follow-up refactor once this stack lands; don't want to mix a config-schema change into this PR.

// leaves the secret reachable under the new key. Keys that do not exist in
// provider are silently skipped, making the function safe to retry.
func MigrateSystemKeys(ctx context.Context, provider Provider, migrations []KeyMigration) error {
for _, m := range migrations {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keyMigrations instead of migrations?

@amirejaz amirejaz force-pushed the phase3-wire-callers branch from a4e622b to f55ffdc Compare March 27, 2026 09:49
@github-actions github-actions bot removed the size/M Medium PR: 300-599 lines changed label Mar 27, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 27, 2026
@amirejaz amirejaz force-pushed the phase3-wire-callers branch from 107da0e to 4a005ee Compare March 27, 2026 12:13
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 27, 2026
@amirejaz amirejaz requested a review from blkt as a code owner March 31, 2026 12:24
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 31, 2026
amirejaz and others added 10 commits March 31, 2026 19:16
When a user upgrades ToolHive, system secrets may still exist under
bare keys (e.g. BEARER_TOKEN_foo) until the secret scope migration
completes. If migration fails or hasn't run yet, ScopedProvider
callers would be unable to find their secrets under the new scoped
key (__thv_workloads_BEARER_TOKEN_foo), breaking workload auth.

Add a transparent fallback in ScopedProvider.GetSecret: on a
not-found response for the scoped key, also try the bare (pre-
migration) key. Once migration completes and bare keys are deleted,
the fallback finds nothing and becomes a natural no-op — no config
check or injection needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add t.Parallel() to TestScopedProvider_GetSecret_MigrationFallback and
its subtests to satisfy the paralleltest linter requirement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update every call site that creates a secrets provider to use the
appropriate wrapper introduced in Phase 1:

- System callers (workload auth tokens, registry credentials, build auth
  files) now use CreateScopedSecretProvider, placing secrets under the
  __thv_<scope>_ prefix and hiding them from user-facing commands.
- User-facing callers (thv secret commands, REST API, MCP tool server,
  header secrets, build-env-from-secrets) now use CreateUserSecretProvider,
  blocking access to __thv_* reserved keys.
- RunConfig.WithSecrets and ValidateSecrets now accept separate system and
  user providers so auth-token resolution and --secret flag resolution use
  the correct scope independently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify that the UserProvider wiring in CLI secret commands correctly
blocks access to __thv_* keys:

- set with __thv_ prefix is rejected
- get with __thv_ prefix is rejected with "reserved for system use"
- delete with __thv_ prefix is rejected
- two distinct __thv_ key names are blocked, confirming the check is
  not tied to a single hardcoded name

Each test uses an isolated XDG_CONFIG_HOME/HOME temp directory so the
user's real secrets store is never touched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add GetUserSecretsProvider() to pkg/auth/secrets alongside the
  existing GetSecretsManager(), centralising the config-read +
  CreateUserSecretProvider boilerplate in one place

- Remove getScopedWorkloadsSecretsManager() from config_buildauthfile.go;
  it was identical to authsecrets.GetSecretsManager() — use that directly

- Replace inline boilerplate in config_buildenv.go/validateSecretExists
  and secret.go/getSecretsManager with calls to GetUserSecretsProvider(),
  eliminating duplicated setup-check + provider-create blocks

- Rename local variable manager → provider in GetSecretsManager() to
  match the naming convention used throughout the new provider code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Existing users may have registry tokens and workload auth secrets stored
under bare keys (BEARER_TOKEN_, OAUTH_CLIENT_SECRET_, REGISTRY_OAUTH_,
etc.) that pre-date the scoped provider wrappers. This migration renames
them into the __thv_<scope>_ namespace on first startup so they are
accessible via the new scoped providers and hidden from user-facing secret
commands.

Key design properties:
- Write-before-delete ordering: the new key is written before the old is
  deleted, so a crash mid-migration leaves the secret reachable.
- Idempotent: a missing old key is silently skipped, making retries safe.
- One-shot: guarded by the SecretScopeMigration config flag; once set,
  the migration is a cheap config read and returns immediately.
- Discovery-based: DiscoverMigrations lists all secrets and matches known
  system prefixes, so no static registry of workload names is required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up from the SecretScope type change: update SystemKeyPrefixMappings
Scope field from string to SecretScope, and replace the removed package-level
scopedKey() function with inline key construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename local variable migrations → keyMigrations in DiscoverMigrations
  for clarity (distinguishes the slice from the function parameter name
  used in MigrateSystemKeys)
- Replace inline strings.HasPrefix(key, SystemKeyPrefix) with isSystemKey(key)
  to reuse the existing helper and keep the intent explicit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace CreateUserSecretProvider and CreateScopedSecretProvider with
the new CreateProvider(managerType, WithUserFacing()) and
CreateProvider(managerType, WithScope(scope)) API introduced in the
phase2-factory-helpers PR.

Affected callers:
- cmd/thv/app/registry_login.go
- cmd/thv/app/header_flags.go
- pkg/auth/secrets/secrets.go
- pkg/runner/env.go
- pkg/runner/runner.go
- pkg/runner/protocol.go
- pkg/mcp/server/set_secret.go
- pkg/mcp/server/list_secrets.go
- pkg/workloads/manager.go
- pkg/api/v1/secrets.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@amirejaz amirejaz requested a review from jerm-dro as a code owner March 31, 2026 14:18
@amirejaz amirejaz force-pushed the phase3-wire-callers branch from 4a005ee to d9168dc Compare March 31, 2026 15:39
@amirejaz amirejaz force-pushed the phase3-wire-callers branch from d9168dc to 8dabc6a Compare March 31, 2026 16:16
Base automatically changed from phase3-wire-callers to main April 1, 2026 09:28
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 1, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

2 participants