fix(experiments): fix unstable local experiment IDs for local data#1600
fix(experiments): fix unstable local experiment IDs for local data#1600
Conversation
|
@claude review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c05e9261ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
LGTM — the core fix is correct and well-scoped; two nits are flagged inline.
Extended reasoning...
Overview
This PR touches langfuse/_client/client.py, langfuse/experiment.py, and tests/test_propagate_attributes.py. The change is focused: generate one shared_fallback_experiment_id before processing items and pass it down so all local-experiment items share a stable ID, add experiment_id to ExperimentResult, and update/add tests to verify the invariant.
Security risks
None. This code path handles experiment telemetry IDs only — no auth, crypto, or permissions logic is involved.
Level of scrutiny
Low. The change is mechanical and narrow: a single UUID is generated before the concurrent item loop, threaded through one additional parameter, and substituted for a per-call self._create_observation_id(). The logic is easy to reason about and the diff is small.
Other factors
Two nits are flagged inline by the bug-hunting system. The first (partial-failure inconsistency) is a pre-existing limitation that this PR makes marginally better, not worse, and affects only error-recovery paths. The second (missing len(root_spans) == 2 guard in the new test) is a test-quality concern that does not affect production correctness. Neither rises to the level of blocking the PR.
| if isinstance(item_metadata, dict): | ||
| final_observation_metadata.update(item_metadata) | ||
|
|
||
| experiment_id = dataset_run_id or self._create_observation_id() | ||
| experiment_id = dataset_run_id or fallback_experiment_id | ||
| experiment_item_id = ( | ||
| dataset_item_id or get_sha256_hash_hex(_serialize(input_data))[:16] | ||
| ) |
There was a problem hiding this comment.
🟡 When dataset_run_items.create fails for some (but not all) items, those failed items are tagged with fallback_experiment_id on their spans, while result.experiment_id returns the dataset_run_id from the successful items — violating the PR’s documented invariant that experiment_id is “stable across all items”. This only occurs in partial API failure scenarios, so the severity is low, but the new public field introduces an observable inconsistency that didn’t exist as a documented guarantee before.
Extended reasoning...
What the bug is and how it manifests
The PR introduces ExperimentResult.experiment_id with the documented guarantee that it is “propagated across all items.” In _process_experiment_item (line 2769), each item sets experiment_id = dataset_run_id or fallback_experiment_id. When the dataset_run_items.create API call succeeds, dataset_run_id is populated and used; when it raises, dataset_run_id remains None and the span is tagged with the per-run fallback_experiment_id instead. Meanwhile, _run_experiment_async derives result.experiment_id as dataset_run_id or shared_fallback_experiment_id (line 2679), where dataset_run_id comes from the first successful item result (line 2626–2633).
The specific code path that triggers it
In a dataset-backed experiment where some items succeed and some fail the dataset_run_items.create call:
- Successful item:
dataset_run_id = "real-run-id-abc"→ span attributeEXPERIMENT_ID = "real-run-id-abc" - Failed item:
dataset_run_id = None→ span attributeEXPERIMENT_ID = fallback_experiment_id(e.g."sdk-uuid-xyz") result.experiment_id = dataset_run_id or shared_fallback_experiment_id = "real-run-id-abc"
Why existing code doesn’t prevent it
The fallback_experiment_id is passed into _process_experiment_item and used as-is on failure — there is no post-processing step to retroactively update failed spans once the authoritative dataset_run_id becomes known from successful sibling items. The PR only resolves the local experiment case (no dataset involved), where all items correctly share the single shared_fallback_experiment_id.
What the impact is
Clients inspecting result.experiment_id and expecting it to match the EXPERIMENT_ID span attribute on every trace will observe a mismatch for failed items. Downstream tooling that joins spans by experiment_id (e.g., dashboards, evaluations) will miss the failed items. That said, this is an error-recovery path — the API call already failed, so the dataset linkage is already broken for those items regardless.
Addressing the refutation
The refutation correctly notes that this inconsistency existed pre-PR (failed items each got a unique fresh random ID, which was worse). The PR does improve matters: all failed items now share the same fallback_experiment_id instead of diverging IDs. However, the pre-PR code made no documented stability guarantee — the new result.experiment_id field explicitly documents the invariant “propagated across all items,” making the inconsistency observable and testable against the public contract.
Step-by-step proof
- Call
run_experimentwith a Langfuse dataset (2 items). - Monkeypatch
dataset_run_items.createto succeed for item 1 (returnsdataset_run_id="real-id") and raise for item 2. - After the run:
result.experiment_id == "real-id"(from line 2679, successful item). - Inspect spans: item 1 span has
EXPERIMENT_ID = "real-id", item 2 span hasEXPERIMENT_ID = "sdk-fallback-uuid"(thefallback_experiment_id). result.experiment_id != span[EXPERIMENT_ID]for item 2 — the documented invariant is broken.
How to fix it
The cleanest fix is to resolve the authoritative dataset_run_id across all items before tagging spans, or to patch failed spans post-hoc once dataset_run_id is known from any successful sibling. Alternatively, the docstring could be scoped: “stable across all successfully registered items.”
| assert result.experiment_id == next(iter(experiment_ids)) | ||
|
|
||
| def test_experiment_attributes_propagate_with_dataset( | ||
| self, langfuse_client, memory_exporter, monkeypatch | ||
| ): | ||
| """Test experiment attribute propagation with Langfuse dataset.""" |
There was a problem hiding this comment.
🟡 The new test_experiment_id_is_stable_across_local_items test asserts len(experiment_ids) == 1 without first verifying that both items actually produced spans, meaning the uniqueness check could trivially pass with only one span. Add assert len(root_spans) == 2 before the experiment_ids set construction to ensure the test genuinely validates cross-item stability.
Extended reasoning...
Missing span count assertion weakens the stability test
The test is designed to verify that a single, stable is propagated across all items in a local experiment run. It creates two local items and then asserts . However, it never asserts that before building the set.
The specific gap:
After , the code jumps directly to constructing and asserting its length is 1. If only 1 of the 2 items produced a span (due to a timing issue, race condition, or regression), the set would still contain exactly 1 element and the assertion would pass without actually validating cross-item stability.
Why existing code does not prevent this:
The + pattern reduces but does not eliminate the possibility that one item span is missing. There is no assertion enforcing that both items must appear. A future regression silently dropping one item span would leave this test green.
Step-by-step proof of the false-positive scenario:
- processes 2 items with the shared .
- Due to a hypothetical timing issue, only item 1 span reaches the exporter; item 2 span is dropped.
- has length 1.
- = , length 1.
- passes, despite only one item being observed.
- The test name says across_local_items but only one item was checked.
Note: the 0-span edge case is covered since raises , so only the 1-of-2 scenario slips through.
Impact:
This is a test quality issue, not a production code bug. The fix in sharing is correct. The regression test could however pass silently even if that fix were broken, as long as at least one span is emitted.
Fix:
Add immediately after the call:
Summary
run_experimentexecutions reuse a single fallbackexperiment_idacross every item in the runexperiment_idonExperimentResultso callers can inspect the stable run identifier directlyexperiment_idin propagation testsRoot cause
_process_experiment_itemgenerated a fresh observation ID for every local item whendataset_run_idwas absent, so items in the same local experiment shared arun_namebut not anexperiment_id.Impact
Local experiments now propagate one stable experiment ID across all item traces and child spans, while dataset-backed runs continue to prefer the dataset run ID.
Validation
uv run pytest tests/test_propagate_attributes.py -k "experiment_attributes_propagate_without_dataset or experiment_id_is_stable_across_local_items or experiment_attributes_propagate_with_dataset"uv run ruff check langfuse/_client/client.py langfuse/experiment.py tests/test_propagate_attributes.pypre-commithook on commit:ruff-check,ruff-format,mypyNotes
tests/test_experiments.pyexperiment coverage in this environment, but those rely on a configured Langfuse backend here and failed due to missing client/base URL configuration rather than this change.Disclaimer: Experimental PR review
Greptile Summary
This PR fixes a bug where each item in a local (non-dataset) experiment received a freshly generated
experiment_idon every call to_process_experiment_item, breaking attribute propagation across items in the same run. The fix generates oneshared_fallback_experiment_idonce in_run_experiment_asyncand threads it through as a parameter, so all items in the same run share a stable identifier. Dataset-backed runs continue to prefer thedataset_run_idreturned by the API.Key changes:
_run_experiment_asyncpre-generatesshared_fallback_experiment_idand passes it to each_process_experiment_itemcall;experiment_id = dataset_run_id or fallback_experiment_idis evaluated per-item.ExperimentResultgains a new requiredexperiment_id: strfield, always populated (eitherdataset_run_idor the fallback), giving callers a stable handle to the run regardless of backing.dataset_run_idextraction at the run level is improved fromvalid_results[0].dataset_run_id(which could returnNoneif the first item's API call failed) tonext(... if result.dataset_run_id, None), correctly skipping failed items.asyncio.to_thread(self.api.dataset_run_items.create, ...).test_experiment_id_is_stable_across_local_itemsand extended assertions in the existing propagation tests.Confidence Score: 5/5
Safe to merge — targeted, well-scoped fix with no regressions and solid test coverage.
All remaining findings are at P2 or lower. The core logic is correct: the fallback ID is generated exactly once, threaded as a plain value (no shared mutable state), and dataset-backed runs are unaffected. The
dataset_run_idextraction improvement is a strict correctness upgrade over the previous first-element access. Tests cover both the local-stability guarantee and the dataset path end-to-end.No files require special attention.
Important Files Changed
_process_experiment_item, and exposes it onExperimentResult. Also improvesdataset_run_idextraction to skipNoneentries instead of blindly using the first result.experiment_id: strfield toExperimentResult.__init__(required, non-optional) with full docstring coverage; stored inself.experiment_id.test_experiment_id_is_stable_across_local_itemsregression test; updates dataset mock from async to sync and adds assertions thatresult.experiment_idmatches span-level EXPERIMENT_ID attributes in both local and dataset-backed tests.Sequence Diagram
sequenceDiagram participant Caller participant run_experiment participant _run_experiment_async participant _process_experiment_item Caller->>run_experiment: run_experiment(name, data, task) run_experiment->>_run_experiment_async: await _run_experiment_async->>_run_experiment_async: shared_fallback_experiment_id = _create_observation_id() loop for each item (concurrent via asyncio.gather) _run_experiment_async->>_process_experiment_item: item, fallback_experiment_id alt DatasetItem (has dataset_id + id) _process_experiment_item->>API: dataset_run_items.create(...) API-->>_process_experiment_item: DatasetRunItem(dataset_run_id=...) _process_experiment_item->>_process_experiment_item: experiment_id = dataset_run_id else Local item _process_experiment_item->>_process_experiment_item: experiment_id = fallback_experiment_id end _process_experiment_item-->>_run_experiment_async: ExperimentItemResult(dataset_run_id=...) end _run_experiment_async->>_run_experiment_async: dataset_run_id = next(result.dataset_run_id for result if result.dataset_run_id, None) _run_experiment_async-->>Caller: ExperimentResult(experiment_id = dataset_run_id or shared_fallback_experiment_id)Reviews (1): Last reviewed commit: "Merge branch 'main' into codex/fix-local..." | Re-trigger Greptile