Skip to content

feat: Exemplars for metric & PromQL charts#2536

Open
jordan-simonovski wants to merge 9 commits into
mainfrom
feat/exemplars
Open

feat: Exemplars for metric & PromQL charts#2536
jordan-simonovski wants to merge 9 commits into
mainfrom
feat/exemplars

Conversation

@jordan-simonovski

@jordan-simonovski jordan-simonovski commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Why

Engineers staring at a latency spike on a chart have no way to jump to a trace that caused it.
This change adds exemplars, clickable markers overlaid on time charts, each linking to a representative trace.
Works for metric and PromQL sources.

Also added some test-telemetry infrastructure to build and validate the feature against more complex data sets.

CleanShot.2026-06-29.at.13.50.52.mp4

What's includedExemplar overlay (app)

  • Diamond markers on HDXMultiSeriesTimeChart, plotted at the trace's own value, thinned to a configurable target across the visible range (the slowest/most-notable trace per window).
  • Hover card showing trace metadata (service, span, duration, status) fetched from a configurable exemplar trace source, with an Inspect trace button that deep-links straight to that trace. The card flips/clamps to stay on-screen near chart edges.
  • Opt-in "Exemplars" toggle in the chart editor (next to "As Ratio" for builder/metric charts; in the PromQL editor for PromQL charts) — persisted on the chart config, not a runtime toggle.

Two data backends

  • Metric sources (ClickHouse): renderMetricExemplarsChartConfig reads the OTel metric tables' Exemplars.* columns, honoring the chart's time range, metric name, and filters.
  • PromQL sources (native Prometheus): new /v1/prometheus/query_exemplars proxy + prometheussponses normalized to a shared Exemplar shape (defensive about trace_id/traceID labelspellings).
  • Fetched in parallel via a new useExemplars hook, gated so it's a no-op unless enabled and.

Fully-OTLP coherent metrics (collector)

  • Added the spanmetrics connector to the collector build and wired it into the OpAMP-generaPAN_METRICS, off by default). It derives traces.span.metrics.* (calls + duration histogram)
    with exemplars from spans, so the duration histogram lands in ClickHouse with Exemplars.* ps — no synthetic/seeded data.
  • Optionally remote-writes those metrics (with exemplars) to a Prometheus endpoint (ENABLE_native PromQL exemplar path is testable against the same real data.

Telemetry generator (telemetry-generator/)

  • A Node service emitting realistic OTLP traces (6 services, weighted attribute pools, nest several failure scenarios) with backfill + live emission. Replaces ad-hoc ClickHouseseeding; wired into docker-compose.dev.yml. The spanmetrics connector turns its traces into coherent metric exemplars.

Team setting

  • maxExemplars (Team Settings → Chart Settings; 0 = unlimited) controls overlay density.

Scoping

  • Exemplars are restricted to single, non-ratio, histogram (latency) metric series — the onalue shares the chart's y-axis unit. Toggle hidden and renderer returns null otherwise.

Out of scope (separate tickets)

  • Heatmap exemplars — needs trace-source exemplar generation + a uPlot overlay.
  • Ratio + Group By — the ratio engine isn't group-aware; tracked as a separate bugfix PR of

Testing

  • common-utils: SQL renderer tests (metric exemplar query shape; null for ratio/multi-series/non-histogram/non-metric).
  • app: normalizePrometheusExemplars label-variant tests; DBTimeChart updated to mock the ne
  • api: query_exemplars route integration tests (native proxy + ClickHouse-backed empty result).
  • Verified end-to-end in an isolated stack: the spanmetrics connector emits a traces.span.metrics.duration histogram with real exemplars (trace id + actual latency), 100% resolving back to seeded traces.
  • make ci-lint / per-package tsc + unit suites green.

Changesets

  • exemplar-mode-metrics.md — @hyperdx/common-utils, @hyperdx/api, @hyperdx/app (minor)
  • span-metrics-connector.md — @hyperdx/api, @hyperdx/otel-collector (minor)

Notes / caveats

  • ENABLE_SPAN_METRICS is off by default — no production behavior change; it's enabled in local dev.
  • HyperDX collectors enforce ingest auth with scheme: '' (raw token, no Bearer prefix) — thon: <INGESTION_API_KEY> accordingly.

@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: d008634

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Minor
@hyperdx/api Minor
@hyperdx/app Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jul 1, 2026 11:57pm
hyperdx-storybook Ready Ready Preview, Comment Jul 1, 2026 11:57pm

Request Review

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jun 29, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (3):
    • packages/api/src/config.ts
    • packages/api/src/models/team.ts
    • packages/otel-collector/builder-config.yaml
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 24
  • Production lines changed: 1610 (+ 274 in test files, excluded from tier calculation)
  • Branch: feat/exemplars
  • Author: jordan-simonovski

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@jordan-simonovski jordan-simonovski marked this pull request as draft June 29, 2026 05:57
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds exemplar markers for metric and PromQL time charts. The main changes are:

  • Exemplar overlay rendering and hover trace cards in the app.
  • PromQL query_exemplars proxy support in the API.
  • ClickHouse exemplar SQL for single histogram metric charts.
  • Optional spanmetrics collector wiring for coherent trace-linked metrics.
  • Local telemetry generation and dev Prometheus setup for testing.

Confidence Score: 4/5

This is close, but one query-scoping bug should be fixed before merging.

  • The remote-write endpoint fix now avoids generating collector config with an unresolved endpoint.
  • Metric exemplar queries can still include markers from other metrics when chart filters use OR logic.
  • That can show trace markers that do not match the plotted histogram series.

packages/common-utils/src/core/renderChartConfig.ts

Important Files Changed

Filename Overview
packages/common-utils/src/core/renderChartConfig.ts Adds metric exemplar SQL generation, but the metric-name predicate can still be grouped with OR user filters.
packages/api/src/config.ts Adds spanmetrics feature flags and gates remote write on a configured endpoint.
packages/api/src/opamp/controllers/opampController.ts Adds spanmetrics collector config and inlines the remote-write endpoint into generated config.
packages/app/src/hooks/useExemplars.tsx Adds exemplar fetching and Prometheus exemplar normalization for supported chart sources.
packages/app/src/HDXMultiSeriesTimeChart.tsx Adds exemplar marker rendering and y-axis handling for marker values.

Comments Outside Diff (1)

  1. packages/common-utils/src/core/renderChartConfig.ts, line 1773-1779 (link)

    P1 Metric Guard Still Escapes

    This still adds the required metric-name check to the same filters array as the user filters. When a chart uses filtersLogicalOperator: 'OR', a user filter such as ServiceName = 'api' can match rows for other metrics, and those rows can be array-joined into this exemplar overlay. The marker set can then include traces that do not belong to the plotted histogram series. Keep the metric-name predicate outside the user filter group so it is always ANDed with the rest of the exemplar query.

    Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Fix All in Claude Code Fix All in Conductor Fix All in Cursor Fix All in Codex

Reviews (6): Last reviewed commit: "Merge branch 'main' into feat/exemplars" | Re-trigger Greptile

Comment thread packages/api/src/opamp/controllers/opampController.ts Outdated
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 224 passed • 3 skipped • 1469s

Status Count
✅ Passed 224
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@jordan-simonovski jordan-simonovski marked this pull request as ready for review June 29, 2026 20:58
Comment thread packages/api/src/config.ts Outdated
…ated collector config, so the collector container no longer needs SPAN_METRICS_PROM_RW_ENDPOINT in its own environment.
Comment on lines +2247 to +2256
const where = await renderWhere(whereConfig, metadata);
const from = renderFrom({ from: whereConfig.from });

return concatChSql(' ', [
chSql`SELECT
toUnixTimestamp64Milli(ex_TimeUnix) AS timestamp,
ex_Value AS value,
ex_TraceId AS traceId,
ex_SpanId AS spanId`,
chSql`FROM ${from}`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Metric Filter Escapes

The metric-name condition is added to filters, so a chart using filtersLogicalOperator: 'OR' can turn the exemplar WHERE clause into userFilterA OR userFilterB OR MetricName = .... In that case, the exemplar scan can include other metrics whenever a user filter matches, or include this metric outside the intended user filter group. Keep the required metric-name check ANDed separately from the user filter group so exemplar markers stay scoped to the plotted series.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Deep Review

No critical issues found. No P0/P1 defects: security cleared the interpolated ClickHouse SQL (interpolated values are trusted admin source-config; the request-controlled traceId is parameterized) and the /query_exemplars proxy (team-scoped connection lookup, same SSRF posture as the existing /query). The opampController config mutation is safe — the base config always defines pipelines.traces, connectors, and exporters, and the remote-write endpoint assertion is guarded. The exemplar query is isolated from the chart render, so a failed fetch cannot break the chart. Recommendations below.

🟡 P2 -- recommended

  • packages/app/src/HDXMultiSeriesTimeChart.tsx:839 -- The exemplar-thinning bucket key is ${groupKey}@${bucket}, so for multi-series PromQL (where each series gets a distinct groupKey) the maxExemplars cap applies per series, yielding up to maxExemplars × series-cardinality ReferenceDot nodes and render jank on high-cardinality queries.
    • Fix: Bucket by time window only, or apply a hard total cap after per-series selection, so marker count cannot scale with PromQL series cardinality.
    • adversarial
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:666 -- The PromQL exemplar path has no result cap (unlike the ClickHouse path's EXEMPLAR_QUERY_LIMIT), and Math.max(...exemplarValues) spreads the whole array as call arguments, which can throw RangeError and blank the chart on a large upstream exemplar set.
    • Fix: Cap the proxied /query_exemplars result count and compute the max with a reduce loop instead of an argument spread.
    • adversarial, julik-frontend-races, kieran-typescript
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:452 -- When a hovered exemplar marker is removed by data refetch/thinning, React unmounts the <g> without firing mouseleave, so the close handler never runs and the hover card lingers indefinitely.
    • Fix: Clear hoveredExemplar when the exemplar/point set identity changes, or drive open/close from a single delegated pointer handler on the chart container.
    • julik-frontend-races
  • packages/app/src/hooks/useExemplars.tsx:186 -- useExemplarTraceMeta hand-builds ClickHouse SQL by interpolating source column expressions, bypassing the renderChartConfig/renderFrom renderer used everywhere else and duplicating escaping/qualification logic the renderer already owns.
    • Fix: Build the trace-meta query through the shared chart-config renderer so column-expression handling and default fallbacks live in one place.
    • maintainability
  • packages/app/src/hooks/useExemplars.tsx:40 -- The Prometheus exemplar response shape is re-declared as an inline structural type on normalizePrometheusExemplars, duplicating PrometheusExemplarsResult/PrometheusQueryExemplarsResponse in api.ts; the two copies will drift when an exporter field changes.
    • Fix: Export the exemplar-series type once and reference it from both the client and the normalizer.
    • maintainability, kieran-typescript
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:783 -- The core new client logic (exemplarPoints dedup/per-bucket thinning and the exemplar-driven y-axis upper bound) has no test coverage, so a regression that hides or clips markers would pass CI silently.
    • Fix: Extract the bucketing and y-axis-domain computation into exported pure helpers and unit-test dedup, per-bucket max selection, per-groupKey separation, and the exemplar-max upper bound.
    • testing, correctness
🔵 P3 nitpicks (13)
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:442 -- ExemplarDot(props: any) disables type checking on cx/cy, the exemplar payload, and both hover callbacks.
    • Fix: Declare an explicit props type for the injected coords and the hover callbacks.
    • maintainability, kieran-typescript
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:514 -- The MemoChart maxExemplars = 12 default param duplicates DEFAULT_MAX_EXEMPLARS.
    • Fix: Reference DEFAULT_MAX_EXEMPLARS from @/defaults instead of the literal.
  • packages/app/src/components/DBTimeChart.tsx:510 -- (source as { traceSourceId?: string }) casts away the real TSource type even though traceSourceId is already on the source schema, masking a future rename.
    • Fix: Use source?.traceSourceId directly.
    • kieran-typescript
  • packages/app/src/hooks/useExemplars.tsx:80 -- Number(r.timestamp)/Number(r.value) (and Number(ex.value)) can yield silent NaN on malformed rows, mispositioning a marker.
    • Fix: Filter out non-finite timestamp/value before pushing.
    • kieran-typescript
  • packages/app/src/hooks/useExemplars.tsx:76 -- mapClickhouseExemplars types raw rows as Record<string, any>, letting downstream field misuse pass unchecked.
    • Fix: Use Record<string, unknown> and narrow via the existing String()/Number() calls.
    • kieran-typescript
  • packages/api/src/opamp/controllers/opampController.ts:391 -- config.SPAN_METRICS_PROM_RW_ENDPOINT! relies on a cross-variable correlation the compiler cannot verify.
    • Fix: Destructure the endpoint into a local const and branch on it so the narrowing is real.
    • kieran-typescript
  • packages/app/src/components/DBTimeChart.tsx:517 -- hoveredExemplar stores absolute pixel coords captured at hover time; after a rescale/refetch the marker moves but the card stays anchored and shows stale trace metadata.
    • Fix: Reset hoveredExemplar when exemplars/graphResults change.
    • julik-frontend-races
  • packages/app/src/HDXMultiSeriesTimeChart.tsx:708 -- When a series selection (or fit mode) is active but no visible series has data, the domain guard returns ['auto','auto'], letting recharts clip the exemplar markers off-chart.
    • Fix: Return [0, exemplarMax * 1.05] when only exemplar values are finite.
    • correctness
  • packages/api/src/routers/api/prometheus.ts:557 -- A ClickHouse-backed PromQL chart routes through query_exemplars and silently receives data: [], showing zero markers with no user-visible reason.
    • Fix: Gate the PromQL exemplar toggle on isPrometheusEndpoint, or surface a "not supported for this backend" hint.
    • api-contract
  • packages/api/src/routers/api/prometheus.ts:546 -- The catch block maps all errors (including server-side failures) to HTTP 400 bad_data, mirroring /query.
    • Fix: Distinguish input-validation (400) from internal errors (500) here and in /query together.
    • reliability
  • packages/common-utils/src/core/renderChartConfig.ts:2264 -- A personal ponytail: TODO marker leaked into shared code, invisible to standard TODO tooling.
    • Fix: Replace with a conventional TODO: or remove it.
    • maintainability
  • telemetry-generator/src/index.js:14 -- The header docblock states metric exemplars are written to ClickHouse directly because the collector has no spanmetrics connector, but the implementation emits only traces and relies on the newly added connector; the same wording leaked into the dev-compose service comment.
    • Fix: Update the docblock and compose comment to describe the connector-derived flow.
    • project-standards
  • telemetry-generator/src/index.js:1 -- The dev-only generator is a single 508-line file, exceeding the repo's documented 300-line guideline.
    • Fix: Split into config/attribute-pools/scenarios/emission/main modules (judgment call for a standalone dev service).
    • project-standards

Reviewers (13): correctness, security, adversarial, performance, api-contract, reliability, kieran-typescript, julik-frontend-races, testing, maintainability, project-standards, agent-native, learnings.

Testing gaps:

  • mapClickhouseExemplars and the useExemplars ClickHouse branch are unexported and untested (only the Prometheus normalizer is covered).
  • buildOtelCollectorConfig's new span_metrics branch (connector, traces-pipeline exporter, metrics/spanmetrics pipeline, and the conditional remote-write exporter) has no test.
  • No test pins renderMetricExemplarsChartConfig returning null for non-histogram metric types, nor the EXEMPLAR_QUERY_LIMIT cap.
  • navigateToExemplarTrace's two URL branches and exemplarPoints boundary cases (granularity → 0, dateRange start==end, maxExemplars<=0) are uncovered.
  • No test asserts an upper bound on the PromQL /query_exemplars result count (the uncapped path behind the two P2 findings above).

Agent-native: PASS. New chart-config fields persist via the REST dashboard API, but are absent from the MCP dashboard tile schema (packages/api/src/mcp/tools/dashboards/schemas.ts) — consistent with the existing backgroundChart precedent; optional follow-up if agent-authored dashboards should toggle exemplars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant