Add OTEL distributed tracing and optional Sentry integration to the API server#4365
Add OTEL distributed tracing and optional Sentry integration to the API server#4365
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4365 +/- ##
==========================================
+ Coverage 69.60% 69.65% +0.04%
==========================================
Files 490 497 +7
Lines 50269 50591 +322
==========================================
+ Hits 34988 35237 +249
- Misses 12595 12660 +65
- Partials 2686 2694 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| sentry.ConfigureScope(func(scope *sentry.Scope) { | ||
| scope.SetTag("custom.user_id", id) | ||
| }) | ||
| slog.Debug("sentry anonymous instance ID tagged", "id", id) |
There was a problem hiding this comment.
[Nit] The tag name custom.user_id is misleading — this is an anonymous instance ID, not a user ID. Anyone with Sentry project access could misinterpret it as PII. Combined with SendDefaultPII: true, this tag alongside IP addresses and request headers could de-anonymize the installation.
Also, Sentry's built-in PII detection heuristics may flag fields named user_id.
Suggestion: Rename to custom.instance_id.
There was a problem hiding this comment.
Fixed, renamed to custom.instance_id. Added a code comment noting that ToolHive Studio currently uses custom.user_id for the same value and should be aligned in a follow-up.
It’s not a problem right now, but you’re right, it should represent that consistently. The only downside is that I’d need to refactor all the dashboards. I’ll take care of that in the future; for now, it’s okay to have this drift.
pkg/api/server.go
Outdated
| if b.otelEnabled { | ||
| r.Use(otelhttp.NewMiddleware("thv-api", | ||
| otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string { | ||
| if routeCtx := chi.RouteContext(r.Context()); routeCtx != nil && routeCtx.RoutePattern() != "" { |
There was a problem hiding this comment.
[Bug - High] The span name formatter is called at span-start time by otelhttp, which is before chi's routing has populated RouteContext.RoutePattern(). At this point in the middleware chain, chi has created a RouteContext (so the nil check passes), but RoutePattern() returns "" because route matching hasn't happened yet.
otelhttp v0.65.0+ has a post-handler rename that checks r.Pattern, but that's a Go 1.22+ net/http.ServeMux feature — chi does not set r.Pattern.
Result: Every span will be named with the raw URL path (e.g., GET /api/v1beta/workloads/my-server) instead of the parameterized route (GET /api/v1beta/workloads/{name}). This causes unbounded cardinality in OTEL/Sentry backends, defeating route-based grouping.
Fix: Add an inner middleware (registered after chi routing) that renames the span:
func chiRouteTagMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
next.ServeHTTP(w, r)
if rctx := chi.RouteContext(r.Context()); rctx != nil && rctx.RoutePattern() != "" {
trace.SpanFromContext(r.Context()).SetName(r.Method + " " + rctx.RoutePattern())
}
})
}There was a problem hiding this comment.
Fixed, removed WithSpanNameFormatter entirely. Added chiRouteSpanNamer, a post-routing middleware that calls next.ServeHTTP first and then renames the span once RouteContext.RoutePattern() is populated. Spans now correctly show as GET /api/v1beta/workloads/{name}.
There was a problem hiding this comment.
Thinking more about this, I also added URL path parameters as span attributes in the same middleware. This way, the concrete MCP name (and any other route params) are still visible at the trace level without increasing cardinality:
- Span name → GET /api/v1beta/workloads/{name} (low cardinality, used for grouping)
- Span attribute → url.path_param.name = "my-server" (high cardinality, for per-trace drill-down)
This is implemented generically in chiRouteSpanNamer by iterating over rctx.URLParams.Keys/Values, so all parameterized routes automatically record their path params, no per-route code needed.
Addressed here
pkg/api/server.go
Outdated
| // The span name uses chi's matched route pattern (e.g. "GET /api/v1beta/workloads/{name}") | ||
| // for clean grouping in Sentry and OTEL backends. | ||
| if b.otelEnabled { | ||
| r.Use(otelhttp.NewMiddleware("thv-api", |
There was a problem hiding this comment.
[Security - Low] With otelhttp as the outermost middleware, it reads W3C traceparent and tracestate headers from incoming requests before authentication. An untrusted client can:
- Inject arbitrary trace IDs to correlate their requests across your telemetry backend
- Set
sampled=1in the traceparent to force 100% sampling of their requests, increasing telemetry costs - Pollute trace context with large
tracestateheaders
This is partially mitigated if you switch to a ParentBased sampler that ignores remote parent sampling decisions (see the comment on otlp/tracing.go).
There was a problem hiding this comment.
Acknowledged. Moving OTEL after auth would break the span lifetime guarantee for the recovery middleware (the span must be active when a panic is caught). Partially mitigated by switching to the ParentBased sampler, see the sampler comment below. Added a code comment documenting the trade-off.
| enableDocs bool, | ||
| oidcConfig *auth.TokenValidatorConfig, | ||
| otelEnabled bool, | ||
| middlewares ...func(http.Handler) http.Handler, |
There was a problem hiding this comment.
[Design - Must Fix] Adding otelEnabled bool as a positional parameter before the variadic middlewares breaks the public API of this exported function. The ServerBuilder already has a fluent With* pattern (WithDebugMode, WithDocs, WithOIDCConfig, etc.).
This approach also makes the call site progressively harder to read with 7+ positional booleans/strings.
Fix: The otelEnabled config should only be added as a field on ServerBuilder via the existing WithOtelEnabled(bool) method, and the Serve convenience function signature should remain unchanged. The caller can pass the OTEL middleware through the existing middlewares variadic, or the Serve function can accept a config struct.
There was a problem hiding this comment.
Fixed — removed otelEnabled from Serve(). The cmd layer now uses ServerBuilder directly and calls WithOtelEnabled(otelEnabled), keeping Serve()'s public signature clean.
| opts := []sdktrace.TracerProviderOption{ | ||
| sdktrace.WithResource(res), | ||
| sdktrace.WithSampler(sdktrace.TraceIDRatioBased(config.SamplingRate)), | ||
| ) |
There was a problem hiding this comment.
[Bug - Medium] Bare TraceIDRatioBased is a root-only sampler. When a remote parent exists with sampled=true (via the W3C traceparent header from ToolHive Studio), this sampler may still drop the span if the trace ID doesn't fall within the ratio. This breaks distributed tracing correlation.
The correct pattern is:
sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(config.SamplingRate)))This is pre-existing, but now materially impactful since this PR adds otelhttp traceparent extraction. Without ParentBased, the advertised Studio end-to-end tracing won't work reliably at sampling rates below 1.0.
There was a problem hiding this comment.
Fixed, wrapped with sdktrace.ParentBased(sdktrace.TraceIDRatioBased(...)). Incoming traceparent headers with sampled=true from ToolHive Studio are now always honoured regardless of the local sampling ratio.
| // (e.g. a Sentry bridge, Datadog exporter) to self-register during their own | ||
| // Init without coupling to the caller that creates the OTEL provider. | ||
| func RegisterSpanProcessor(p sdktrace.SpanProcessor) { | ||
| if p == nil { |
There was a problem hiding this comment.
[Design - Low] Two concerns with the global registry:
-
No dedup guard: If
sentry.Init()is called twice (e.g., config reload, tests),RegisterSpanProcessorappends the samesentryotel.NewSentrySpanProcessor()singleton pointer twice. This causesOnStart/OnEndto fire twice per span. -
Init ordering dependency:
sentry.Init()must be called beforetelemetry.NewProvider()for the processor to be picked up. This ordering is correct in the current code but is implicit and fragile. Consider documenting the ordering requirement, or adding a dedup check:
func RegisterSpanProcessor(p sdktrace.SpanProcessor) {
if p == nil { return }
globalProcessorsMu.Lock()
defer globalProcessorsMu.Unlock()
for _, existing := range globalProcessors {
if existing == p { return } // already registered
}
globalProcessors = append(globalProcessors, p)
}There was a problem hiding this comment.
Fixed,added a pointer-equality check in RegisterSpanProcessor so duplicate registrations are silently ignored. Added a doc comment documenting the ordering requirement (processors must register before NewProvider is called).
| func NewProvider(ctx context.Context, config Config) (*Provider, error) { | ||
| // Optional extra span processors (e.g. a Sentry bridge) can be registered via extraProcessors. | ||
| func NewProvider(ctx context.Context, config Config, extraProcessors ...sdktrace.SpanProcessor) (*Provider, error) { | ||
| // Validate configuration |
There was a problem hiding this comment.
[Testing] The new extraProcessors ...sdktrace.SpanProcessor parameter and the registeredSpanProcessors() merge on the lines below are never tested end-to-end. No test calls RegisterSpanProcessor(p) followed by NewProvider(ctx, config) and then verifies the processor ends up in the SDK tracer.
The strategy tests set ExtraSpanProcessors directly on the Config struct, and TestInit_RegistersSpanProcessor only checks HasRegisteredSpanProcessors() — the complete pipeline is untested.
There was a problem hiding this comment.
Added TestNewProvider_PicksUpRegisteredProcessor in pkg/telemetry/registry_test.go, registers a tracetest.SpanRecorder, calls NewProvider, creates a span, and asserts the recorder received the OnEnd callback. Also added TestRegisterSpanProcessor_Dedup to cover the new dedup guard.
| }() | ||
|
|
||
| assert.True(t, telemetry.HasRegisteredSpanProcessors()) | ||
| }) |
There was a problem hiding this comment.
[Testing - Medium] Only the guard clauses are tested here (not-initialized, nil-error). The actual hub.CaptureException(err) call — the primary error-reporting path and the whole point of the Sentry integration — is never reached in any test.
Suggestion: Add a happy-path test using sentry-go's test transport:
t.Run("captures exception when initialized", func(t *testing.T) {
initialized.Store(false)
transport := &sentry.TransportMock{}
err := sentry.Init(sentry.ClientOptions{
Dsn: "https://key@o0.ingest.sentry.io/0",
Transport: transport,
})
require.NoError(t, err)
initialized.Store(true)
defer initialized.Store(false)
req := httptest.NewRequest(http.MethodGet, "/", nil)
CaptureException(req, errors.New("test error"))
require.Equal(t, 1, len(transport.Events()))
})There was a problem hiding this comment.
Added captures exception when initialized sub-test using gosentry.MockTransport, verifies a call to CaptureException results in exactly one event on the transport after Flush.
| initialized.Store(true) | ||
| defer initialized.Store(false) | ||
| req := httptest.NewRequest(http.MethodGet, "/", nil) | ||
| CaptureException(req, nil) |
There was a problem hiding this comment.
[Testing - Medium] Same gap as CaptureException above — RecoverPanic happy path (initialized=true, non-nil recovered value) is never tested. The hub.RecoverWithContext() and hub.Flush() calls are completely unreached.
Also note: the hub fallback path (GetHubFromContext returns nil, clone CurrentHub()) — which is the only production path since the PR doesn't use sentryhttp middleware — is never exercised.
There was a problem hiding this comment.
Added recovers panic and creates Sentry event sub-test using gosentry.MockTransport, verifies the hub.RecoverWithContext + hub.Flush path creates an event, including the hub fallback path (clone CurrentHub()) which is the only production path without sentryhttp
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Drop NewMiddleware() backed by sentryhttp in favour of SpanProcessor() which returns a sentryotel.NewSentrySpanProcessor. This lets the OTEL SDK drive tracing while Sentry receives spans as a processor, aligning the API server with W3C traceparent-based distributed tracing. Also tags every Sentry event with custom.user_id from the anonymous instance ID so API server events correlate with toolhive-studio.
Add ExtraSpanProcessors to the provider config and thread them through NewProvider → StrategySelector → OTLPTracerStrategy → tracing.go. When extra processors are present but no OTLP endpoint is configured, a real SDK tracer provider is created (without an exporter) so processors still receive spans. This enables Sentry-only tracing without requiring an external collector.
Add initServeOTEL() to cmd/thv/app/server.go that reads the global OTEL
config and registers the Sentry span processor when available. In
Sentry-only mode (no OTLP endpoint) tracing is force-enabled at 100%
OTEL sampling so every span reaches the Sentry processor.
Replace sentryhttp middleware in pkg/api/server.go with otelhttp, using
WithSpanNameFormatter to produce route-aware span names like
'GET /api/v1beta/workloads/{name}' instead of a fixed 'thv-api'.
Replace direct Sentry calls (RecoverPanic, CaptureException) with standard OTEL span.RecordError + span.SetStatus(codes.Error). The Sentry span processor picks these up automatically, making both files vendor-agnostic — errors flow to any configured OTEL backend. Reorder middleware so otelhttp is outermost and recovery is inner. This ensures the span is still active when recovery catches a panic, so RecordError is not a no-op on an already-ended span.
Add telemetry.RegisterSpanProcessor so integrations self-register during their own Init instead of being wired explicitly by the caller. sentrypkg.Init now registers the Sentry bridge automatically, and initServeOTEL uses telemetry.HasRegisteredSpanProcessors with no Sentry-specific knowledge. Any future provider follows the same pattern.
Set SendDefaultPII: false to prevent OIDC bearer tokens and cookies from being shipped to Sentry SaaS. Rename the scope tag from custom.user_id to custom.instance_id to avoid PII detection heuristics and match its true meaning as an anonymous instance identifier.
Add a pointer-equality check in RegisterSpanProcessor so that calling sentry.Init more than once (e.g. in tests or on config reload) does not register the same processor twice and cause OnStart/OnEnd to fire twice per span. Document the ordering requirement: processors must be registered before NewProvider is called.
Wrap TraceIDRatioBased with ParentBased so that when ToolHive Studio sends a W3C traceparent header with sampled=true, the API server always samples the child span regardless of the local ratio. Without this, distributed traces from Studio could be silently dropped at sampling rates below 1.0.
Replace raw panic values and full error chains in span.RecordError with generic messages to avoid sending potentially sensitive data to external telemetry backends. Full details remain in the local slog output. Also re-add explicit sentrypkg.RecoverPanic and sentrypkg.CaptureException calls alongside span.RecordError: the Sentry span processor only creates transactions, not Issues. Both calls are needed for errors to appear in the Sentry Issues tab.
Replace WithSpanNameFormatter (which runs before chi populates
RouteContext.RoutePattern) with chiRouteSpanNamer, an inner middleware
that renames the span after routing has resolved. This produces names like
GET /api/v1beta/workloads/{name} instead of raw URL paths, preventing
unbounded cardinality in OTEL and Sentry backends.
Also remove the otelEnabled positional parameter from Serve() to keep its
public signature stable. Callers that need OTEL control use ServerBuilder
directly via WithOtelEnabled.
Move the 50-line initServeOTEL function from cmd/thv/app/server.go to pkg/telemetry/NewServeProvider following the cli-commands convention that cmd/ must be thin wrappers. The new function encapsulates config reading, telemetry provider construction, and the Sentry-only sampling override.
Replace the goroutine-based OTEL provider shutdown with deterministic LIFO defer ordering: the OTEL provider (which flushes the Sentry span processor) now shuts down before sentrypkg.Close(), eliminating the race where sentrypkg.Close could run first and leave the span processor without a client to flush to. Also read SENTRY_DSN and SENTRY_ENVIRONMENT from environment variables as fallback when the corresponding CLI flags are not set, which avoids exposing the DSN in ps output.
Add TestNewProvider_PicksUpRegisteredProcessor to verify the complete RegisterSpanProcessor -> NewProvider pipeline end-to-end using a real SpanRecorder. Add TestRegisterSpanProcessor_Dedup to cover the new dedup guard. Add happy-path tests for CaptureException and RecoverPanic using sentry.MockTransport to confirm events reach the Sentry transport.
…base OpenTelemetryConfig.TracingEnabled and MetricsEnabled changed to *bool in main. Dereference them safely in NewServeProvider and use the already resolved bool values in the slog debug statement. Made-with: Cursor
Correct indentation inside Middleware's panic recovery block (lines were at 3 tabs instead of the required 4 inside the if rec != nil branch). Rename unused method receivers in registry_test.go to _ to satisfy revive's unused-receiver rule. Made-with: Cursor
gofmt requires brace alignment on adjacent single-line functions. staticcheck ST1006 requires omitting unused receiver names entirely rather than using _. Made-with: Cursor
The parameterised span name (e.g. GET /api/v1beta/workloads/{name})
keeps cardinality low for grouping in OTEL/Sentry. The resolved path
parameters (url.path_param.name = "my-server") are now also recorded
as span attributes, so the concrete MCP name remains visible at the
individual trace level without inflating backend cardinality.
Made-with: Cursor
Summary
Adds optional distributed tracing and error reporting to `thv serve` using an OTEL-first approach with Sentry as an optional backend.
When no DSN and no OTEL endpoint are configured, all operations are no-ops with zero overhead.
Type of change
Does this introduce a user-facing change?
Yes. `thv serve` now accepts optional `--sentry-dsn`, `--sentry-environment` and `--sentry-traces-sample-rate` flags (also readable from `SENTRY_DSN` / `SENTRY_ENVIRONMENT` env vars). Distributed tracing is configured via the existing `thv config otel set-endpoint` command.
Special notes for reviewers
The `telemetry.RegisterSpanProcessor` registry makes the OTEL provider setup fully provider-agnostic. Sentry self-registers during `sentrypkg.Init` — adding a future provider (e.g. Datadog) requires no changes to `server.go`.
Shutdown ordering is deterministic via LIFO defer: the OTEL provider (which flushes the Sentry span processor) shuts down before `sentrypkg.Close()`.
Large PR Justification
This PR introduces a new cross-cutting observability capability that spans multiple layers of the stack and cannot be meaningfully split without leaving the codebase in a broken or misleading intermediate state:
pkg/telemetry), the span processor registry, the Sentry bridge (pkg/sentry), and the API server wiring (pkg/api,cmd/thv/app) must all land together. Merging any subset would either leave dead code, broken imports, or a non-functional tracing pipelineotelhttpmust be outermost,chiRouteSpanNamermust be inner, andrecoverymust be innermost. Any partial merge that changes this order would silently break panic capture on active spans