Skip to content

fix(experiments): fix unstable local experiment IDs for local data#1600

Merged
hassiebp merged 2 commits intomainfrom
codex/fix-local-experiment-id
Apr 1, 2026
Merged

fix(experiments): fix unstable local experiment IDs for local data#1600
hassiebp merged 2 commits intomainfrom
codex/fix-local-experiment-id

Conversation

@hassiebp
Copy link
Copy Markdown
Contributor

@hassiebp hassiebp commented Apr 1, 2026

Summary

  • make local run_experiment executions reuse a single fallback experiment_id across every item in the run
  • return experiment_id on ExperimentResult so callers can inspect the stable run identifier directly
  • add regression coverage for multi-item local experiments and assert the exposed experiment_id in propagation tests

Root cause

_process_experiment_item generated a fresh observation ID for every local item when dataset_run_id was absent, so items in the same local experiment shared a run_name but not an experiment_id.

Impact

Local experiments now propagate one stable experiment ID across all item traces and child spans, while dataset-backed runs continue to prefer the dataset run ID.

Validation

  • uv run pytest tests/test_propagate_attributes.py -k "experiment_attributes_propagate_without_dataset or experiment_id_is_stable_across_local_items or experiment_attributes_propagate_with_dataset"
  • uv run ruff check langfuse/_client/client.py langfuse/experiment.py tests/test_propagate_attributes.py
  • pre-commit hook on commit: ruff-check, ruff-format, mypy

Notes

  • I also tried the higher-level tests/test_experiments.py experiment coverage in this environment, but those rely on a configured Langfuse backend here and failed due to missing client/base URL configuration rather than this change.

Disclaimer: Experimental PR review

Greptile Summary

This PR fixes a bug where each item in a local (non-dataset) experiment received a freshly generated experiment_id on every call to _process_experiment_item, breaking attribute propagation across items in the same run. The fix generates one shared_fallback_experiment_id once in _run_experiment_async and threads it through as a parameter, so all items in the same run share a stable identifier. Dataset-backed runs continue to prefer the dataset_run_id returned by the API.

Key changes:

  • _run_experiment_async pre-generates shared_fallback_experiment_id and passes it to each _process_experiment_item call; experiment_id = dataset_run_id or fallback_experiment_id is evaluated per-item.
  • ExperimentResult gains a new required experiment_id: str field, always populated (either dataset_run_id or the fallback), giving callers a stable handle to the run regardless of backing.
  • dataset_run_id extraction at the run level is improved from valid_results[0].dataset_run_id (which could return None if the first item's API call failed) to next(... if result.dataset_run_id, None), correctly skipping failed items.
  • The test mock for the dataset path is updated from async to sync, aligning with the SDK's use of asyncio.to_thread(self.api.dataset_run_items.create, ...).
  • Regression coverage is added via test_experiment_id_is_stable_across_local_items and extended assertions in the existing propagation tests.

Confidence Score: 5/5

Safe to merge — targeted, well-scoped fix with no regressions and solid test coverage.

All remaining findings are at P2 or lower. The core logic is correct: the fallback ID is generated exactly once, threaded as a plain value (no shared mutable state), and dataset-backed runs are unaffected. The dataset_run_id extraction improvement is a strict correctness upgrade over the previous first-element access. Tests cover both the local-stability guarantee and the dataset path end-to-end.

No files require special attention.

Important Files Changed

Filename Overview
langfuse/_client/client.py Generates one shared fallback experiment ID before processing items, passes it into _process_experiment_item, and exposes it on ExperimentResult. Also improves dataset_run_id extraction to skip None entries instead of blindly using the first result.
langfuse/experiment.py Adds experiment_id: str field to ExperimentResult.__init__ (required, non-optional) with full docstring coverage; stored in self.experiment_id.
tests/test_propagate_attributes.py Adds test_experiment_id_is_stable_across_local_items regression test; updates dataset mock from async to sync and adds assertions that result.experiment_id matches span-level EXPERIMENT_ID attributes in both local and dataset-backed tests.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant run_experiment
    participant _run_experiment_async
    participant _process_experiment_item

    Caller->>run_experiment: run_experiment(name, data, task)
    run_experiment->>_run_experiment_async: await
    _run_experiment_async->>_run_experiment_async: shared_fallback_experiment_id = _create_observation_id()
    loop for each item (concurrent via asyncio.gather)
        _run_experiment_async->>_process_experiment_item: item, fallback_experiment_id
        alt DatasetItem (has dataset_id + id)
            _process_experiment_item->>API: dataset_run_items.create(...)
            API-->>_process_experiment_item: DatasetRunItem(dataset_run_id=...)
            _process_experiment_item->>_process_experiment_item: experiment_id = dataset_run_id
        else Local item
            _process_experiment_item->>_process_experiment_item: experiment_id = fallback_experiment_id
        end
        _process_experiment_item-->>_run_experiment_async: ExperimentItemResult(dataset_run_id=...)
    end
    _run_experiment_async->>_run_experiment_async: dataset_run_id = next(result.dataset_run_id for result if result.dataset_run_id, None)
    _run_experiment_async-->>Caller: ExperimentResult(experiment_id = dataset_run_id or shared_fallback_experiment_id)
Loading

Reviews (1): Last reviewed commit: "Merge branch 'main' into codex/fix-local..." | Re-trigger Greptile

@hassiebp hassiebp marked this pull request as ready for review April 1, 2026 08:50
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

@claude review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c05e9261ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@hassiebp hassiebp changed the title [codex] Fix unstable local experiment IDs Apr 1, 2026
@hassiebp hassiebp merged commit e32aeb2 into main Apr 1, 2026
5 checks passed
@hassiebp hassiebp deleted the codex/fix-local-experiment-id branch April 1, 2026 08:56
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the core fix is correct and well-scoped; two nits are flagged inline.

Extended reasoning...

Overview

This PR touches langfuse/_client/client.py, langfuse/experiment.py, and tests/test_propagate_attributes.py. The change is focused: generate one shared_fallback_experiment_id before processing items and pass it down so all local-experiment items share a stable ID, add experiment_id to ExperimentResult, and update/add tests to verify the invariant.

Security risks

None. This code path handles experiment telemetry IDs only — no auth, crypto, or permissions logic is involved.

Level of scrutiny

Low. The change is mechanical and narrow: a single UUID is generated before the concurrent item loop, threaded through one additional parameter, and substituted for a per-call self._create_observation_id(). The logic is easy to reason about and the diff is small.

Other factors

Two nits are flagged inline by the bug-hunting system. The first (partial-failure inconsistency) is a pre-existing limitation that this PR makes marginally better, not worse, and affects only error-recovery paths. The second (missing len(root_spans) == 2 guard in the new test) is a test-quality concern that does not affect production correctness. Neither rises to the level of blocking the PR.

Comment on lines 2766 to 2772
if isinstance(item_metadata, dict):
final_observation_metadata.update(item_metadata)

experiment_id = dataset_run_id or self._create_observation_id()
experiment_id = dataset_run_id or fallback_experiment_id
experiment_item_id = (
dataset_item_id or get_sha256_hash_hex(_serialize(input_data))[:16]
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 When dataset_run_items.create fails for some (but not all) items, those failed items are tagged with fallback_experiment_id on their spans, while result.experiment_id returns the dataset_run_id from the successful items — violating the PR’s documented invariant that experiment_id is “stable across all items”. This only occurs in partial API failure scenarios, so the severity is low, but the new public field introduces an observable inconsistency that didn’t exist as a documented guarantee before.

Extended reasoning...

What the bug is and how it manifests

The PR introduces ExperimentResult.experiment_id with the documented guarantee that it is “propagated across all items.” In _process_experiment_item (line 2769), each item sets experiment_id = dataset_run_id or fallback_experiment_id. When the dataset_run_items.create API call succeeds, dataset_run_id is populated and used; when it raises, dataset_run_id remains None and the span is tagged with the per-run fallback_experiment_id instead. Meanwhile, _run_experiment_async derives result.experiment_id as dataset_run_id or shared_fallback_experiment_id (line 2679), where dataset_run_id comes from the first successful item result (line 2626–2633).

The specific code path that triggers it

In a dataset-backed experiment where some items succeed and some fail the dataset_run_items.create call:

  1. Successful item: dataset_run_id = "real-run-id-abc" → span attribute EXPERIMENT_ID = "real-run-id-abc"
  2. Failed item: dataset_run_id = None → span attribute EXPERIMENT_ID = fallback_experiment_id (e.g. "sdk-uuid-xyz")
  3. result.experiment_id = dataset_run_id or shared_fallback_experiment_id = "real-run-id-abc"

Why existing code doesn’t prevent it

The fallback_experiment_id is passed into _process_experiment_item and used as-is on failure — there is no post-processing step to retroactively update failed spans once the authoritative dataset_run_id becomes known from successful sibling items. The PR only resolves the local experiment case (no dataset involved), where all items correctly share the single shared_fallback_experiment_id.

What the impact is

Clients inspecting result.experiment_id and expecting it to match the EXPERIMENT_ID span attribute on every trace will observe a mismatch for failed items. Downstream tooling that joins spans by experiment_id (e.g., dashboards, evaluations) will miss the failed items. That said, this is an error-recovery path — the API call already failed, so the dataset linkage is already broken for those items regardless.

Addressing the refutation

The refutation correctly notes that this inconsistency existed pre-PR (failed items each got a unique fresh random ID, which was worse). The PR does improve matters: all failed items now share the same fallback_experiment_id instead of diverging IDs. However, the pre-PR code made no documented stability guarantee — the new result.experiment_id field explicitly documents the invariant “propagated across all items,” making the inconsistency observable and testable against the public contract.

Step-by-step proof

  1. Call run_experiment with a Langfuse dataset (2 items).
  2. Monkeypatch dataset_run_items.create to succeed for item 1 (returns dataset_run_id="real-id") and raise for item 2.
  3. After the run: result.experiment_id == "real-id" (from line 2679, successful item).
  4. Inspect spans: item 1 span has EXPERIMENT_ID = "real-id", item 2 span has EXPERIMENT_ID = "sdk-fallback-uuid" (the fallback_experiment_id).
  5. result.experiment_id != span[EXPERIMENT_ID] for item 2 — the documented invariant is broken.

How to fix it

The cleanest fix is to resolve the authoritative dataset_run_id across all items before tagging spans, or to patch failed spans post-hoc once dataset_run_id is known from any successful sibling. Alternatively, the docstring could be scoped: “stable across all successfully registered items.”

Comment on lines +2507 to 2512
assert result.experiment_id == next(iter(experiment_ids))

def test_experiment_attributes_propagate_with_dataset(
self, langfuse_client, memory_exporter, monkeypatch
):
"""Test experiment attribute propagation with Langfuse dataset."""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new test_experiment_id_is_stable_across_local_items test asserts len(experiment_ids) == 1 without first verifying that both items actually produced spans, meaning the uniqueness check could trivially pass with only one span. Add assert len(root_spans) == 2 before the experiment_ids set construction to ensure the test genuinely validates cross-item stability.

Extended reasoning...

Missing span count assertion weakens the stability test

The test is designed to verify that a single, stable is propagated across all items in a local experiment run. It creates two local items and then asserts . However, it never asserts that before building the set.

The specific gap:
After , the code jumps directly to constructing and asserting its length is 1. If only 1 of the 2 items produced a span (due to a timing issue, race condition, or regression), the set would still contain exactly 1 element and the assertion would pass without actually validating cross-item stability.

Why existing code does not prevent this:
The + pattern reduces but does not eliminate the possibility that one item span is missing. There is no assertion enforcing that both items must appear. A future regression silently dropping one item span would leave this test green.

Step-by-step proof of the false-positive scenario:

  1. processes 2 items with the shared .
  2. Due to a hypothetical timing issue, only item 1 span reaches the exporter; item 2 span is dropped.
  3. has length 1.
  4. = , length 1.
  5. passes, despite only one item being observed.
  6. The test name says across_local_items but only one item was checked.

Note: the 0-span edge case is covered since raises , so only the 1-of-2 scenario slips through.

Impact:
This is a test quality issue, not a production code bug. The fix in sharing is correct. The regression test could however pass silently even if that fix were broken, as long as at least one span is emitted.

Fix:
Add immediately after the call:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant