Skip to content

fix: optimizes prefix-based caching#425

Open
adavyas wants to merge 16 commits intomainfrom
fix/prefix-based-cache-optim
Open

fix: optimizes prefix-based caching#425
adavyas wants to merge 16 commits intomainfrom
fix/prefix-based-cache-optim

Conversation

@adavyas
Copy link
Copy Markdown
Contributor

@adavyas adavyas commented Mar 14, 2026

Summary

Improves prompt prefix caching by separating stable system instructions from higher-variance request content.

This keeps the cacheable prefix stable across repeated calls and adds supporting metrics/tests for cache behavior.

What Changed

  • split deriver prompt construction into stable system prompt + variable user prompt
  • improve summary prompt construction to use stable system instructions plus request-specific user content
  • add cache-control normalization for cacheable system blocks in client handling
  • add Gemini cached-token extraction to telemetry response handling
  • add cache comparison tooling for prefix-cache evaluation
  • add/update tests around:
    • client cache-control behavior
    • summarizer prompt structure
    • cache token extraction

Why

Previously, stable instructions and changing context were more tightly coupled. That reduces prefix-cache reuse when only part of the prompt changes.

This PR isolates stable prompt sections so provider-side prompt caching can reuse more of the unchanged prefix.

Notes

  • no functional feature change is intended beyond prompt structuring and cache-related telemetry
  • branch-specific CI/type-check fixes were included where needed

Testing

Verified locally with:

  • py_compile on changed files
  • prompt/caching structure checks

Summary by CodeRabbit

  • New Features

    • Add a comparison tool to probe prompt-prefix cache behavior across code variants, producing per-call metrics, human-readable results, delta summaries, and optional JSON export.
    • Surface improved cache read/create metrics for additional providers (including Gemini).
  • Chores

    • Move many prompt flows to explicit system/user message pairs for derivation, summarization, and response construction.
    • Stop mutating base system content; append session history as a separate message.
  • Tests

    • Add tests covering cache token extraction, system-message cache metadata, and message-based summary creation.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a new evaluation script to compare prompt-prefix cache behavior across worktrees; refactors multiple prompt flows to separate system/user messages; changes session-history injection to a separate system message; introduces Gemini cache-token extraction and propagation; and updates tests for these behaviors.

Changes

Cohort / File(s) Summary
Benchmarking script
scripts/compare_prefix_cache.py
New eval-only script that spawns child processes to import per-worktree honcho_llm_call_inner, runs probe scenarios (prime/transition/steady), and emits per-call cache/latency/token metrics as JSON.
Deriver
src/deriver/deriver.py, src/deriver/prompts.py
Split deriver prompt into system and user builders; deriver now sends a messages payload (system + user) instead of a single combined prompt; token estimation updated.
Summarizer
src/utils/summarizer.py, tests/utils/test_summarizer.py
Introduce short/long system and user prompt builders and create_short_summary/create_long_summary; switch to message-based LLM calls and update token estimates; tests assert message composition.
LLM client / cache handling
src/utils/clients.py, tests/utils/test_clients.py
Add cacheable text block normalization, Gemini content builder, and extract_gemini_cache_tokens; normalize system content into cacheable blocks and propagate Gemini/OpenAI-style cache token metadata into responses; tests extended for cache tokens and system-block handling.
Session history handling
src/dialectic/core.py
Stop mutating base system prompt; append session history as a separate system message to preserve stable instructions for caching.
Tests & coverage
tests/utils/test_clients.py, tests/utils/test_summarizer.py
Expanded tests for Gemini/Anthropic/OpenAI flows, cache token extraction/propagation, and summarizer message composition; new tests validate create_short_summary/create_long_summary behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant Orchestrator as Orchestrator Script
    participant Child as Child Process<br/>(worktree honcho_llm_call_inner)
    participant PromptBuilder as Prompt Builder<br/>(system + user)
    participant Provider as LLM Provider
    participant Results as Metrics Collector

    rect rgba(120, 180, 240, 0.5)
    Orchestrator->>Child: spawn subprocess\n(pass scenario payload & namespace)
    Child->>PromptBuilder: build messages\n(system_msg, user_msg, namespace)
    PromptBuilder-->>Child: messages
    Child->>Provider: honcho_llm_call_inner(messages)
    Provider-->>Child: response + usage_metadata
    Child->>Results: emit per-call metrics JSON\n(duration, tokens, cache_read/create, preview)
    Orchestrator-->>Results: aggregate scenario results
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Suggested reviewers

  • VVoruganti
  • Rajat-Ahuja1997

Poem

🐰
I hopped through prompts both old and new,
Split system from user to make caches true,
Spawned tiny subprocesses to probe and see,
Tokens tallied, latencies set free,
Prefix gardens hum — cached replies for thee!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main objective of the PR: optimizing prefix-based caching through prompt restructuring to separate stable system instructions from variable content.
Docstring Coverage ✅ Passed Docstring coverage is 95.16% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/prefix-based-cache-optim

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@adavyas adavyas marked this pull request as ready for review March 14, 2026 20:55
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/deriver/deriver.py (1)

127-149: ⚠️ Potential issue | 🟠 Major

prompt="" regresses the Gemini deriver path.

Line 129 now empties prompt, and Lines 142-148 move the entire request into messages. That works for the text-response providers, but the Google structured-output branch in src/utils/clients.py still calls generate_content(... contents=prompt ...) when response_model is set. With settings.DERIVER.PROVIDER == "google", this sends an empty prompt and drops both the instructions and the conversation body. Please either keep passing the combined prompt here until that client path understands messages, or fix the client and add a regression test for response_model + messages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/deriver/deriver.py` around lines 127 - 149, The call to honcho_llm_call
in deriver.py sets prompt="" and moves instructions into messages, which breaks
the Google structured-output path that still calls generate_content(...
contents=prompt ...) when response_model is set; either restore a combined
prompt (e.g., concatenate minimal_deriver_system_prompt(observed) +
minimal_deriver_user_prompt(formatted_messages)) when calling honcho_llm_call,
or (preferred) update the Google client in src/utils/clients.py (the
generate_content/Google structured-output branch) to accept and use the messages
parameter when response_model is present instead of only using prompt; after
fixing, add a regression test that sets settings.DERIVER.PROVIDER == "google"
and calls honcho_llm_call with response_model=PromptRepresentation and messages
to ensure the instructions+conversation are sent and parsed correctly.
🧹 Nitpick comments (3)
src/deriver/prompts.py (1)

51-69: Compose minimal_deriver_prompt() from the split helpers.

This is now the only place still hand-building the <messages> wrapper. Reusing minimal_deriver_user_prompt() keeps the combined helper aligned with estimate_minimal_deriver_prompt_tokens() and avoids prompt drift.

♻️ Proposed cleanup
 def minimal_deriver_prompt(
     peer_id: str,
     messages: str,
 ) -> str:
@@
-    return c(
-        f"""
-{minimal_deriver_system_prompt(peer_id)}
-
-Messages to analyze:
-<messages>
-{messages}
-</messages>
-"""
-    )
+    return "\n\n".join(
+        [
+            minimal_deriver_system_prompt(peer_id),
+            minimal_deriver_user_prompt(messages),
+        ]
+    )

Based on learnings: deriver token estimation should use empty inputs against the same prompt template structure to avoid double-counting wrapper overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/deriver/prompts.py` around lines 51 - 69, The combined prompt in
minimal_deriver_prompt is hand-wrapping <messages> instead of reusing the split
helper; change minimal_deriver_prompt to build its return by concatenating
minimal_deriver_system_prompt(peer_id) with minimal_deriver_user_prompt(...)
rather than manually injecting the <messages> wrapper so the template matches
estimate_minimal_deriver_prompt_tokens and avoids prompt drift—locate
minimal_deriver_prompt and replace the manual block with a call to
minimal_deriver_user_prompt using the same arguments (and keep
minimal_deriver_system_prompt(peer_id) as before).
scripts/compare_prefix_cache.py (2)

259-274: Use an answerable prompt in change_base.

Lines 263, 268, and 273 ask for a city that never appears in either fixture block. That makes the only base-change scenario sensitive to hallucination/abstention behavior instead of just cache behavior, which adds noise to the latency and output-token comparison.

Suggested change
-                "user_query": "What city is the user considering for a move?",
+                "user_query": "What exact date did the user move the launch to?",
@@
-                "user_query": "What city is the user considering for a move?",
+                "user_query": "What exact date did the user move the launch to?",
@@
-                "user_query": "What city is the user considering for a move?",
+                "user_query": "What exact date did the user move the launch to?",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/compare_prefix_cache.py` around lines 259 - 274, The change_base
scenario currently uses a user_query asking for a city that never appears in the
fixture blocks, which conflates cache behavior with model
hallucination/abstention; update the "change_base" entries (prime, transition,
steady) to use an answerable prompt that references a city present in the
fixtures (keep BASE_PREFIX, BASE_PREFIX_VARIANT, and ROLLING_HISTORY_A as-is) so
the comparison focuses on cache behavior rather than factual recall—modify the
"user_query" values under the "change_base" dict to a question that can be
answered from the fixture content.

119-129: Add Google-style docstrings to the new helpers and dataclasses.

This file introduces public dataclasses and top-level helpers without docstrings, which makes the probe contract harder to follow when someone updates the scenarios or result shape later.

As per coding guidelines, **/*.py: Use Google style docstrings.

Also applies to: 132-438

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/compare_prefix_cache.py` around lines 119 - 129, Add Google-style
docstrings to each public dataclass and top-level helper in this module
(including ProviderSpec and VariantSpec and the functions introduced between
lines 132-438) so the probe contract is explicit. For each dataclass
(ProviderSpec, VariantSpec) add a short one-line summary, followed by Args
describing each field (label, provider, model for ProviderSpec; label, worktree
for VariantSpec). For each top-level helper function, add a Google-style
docstring with a one-line summary, Args for parameters, Returns for return
values, and Raises if exceptions are expected. Ensure docstrings are placed
directly above the class/function definitions and use consistent formatting
across the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/compare_prefix_cache.py`:
- Around line 282-306: The namespace built in build_scenario_payload is
deterministic and must be salted per process run; generate a single run-level
salt (e.g., invocation_salt) once at script startup and append it to the
namespace construction inside build_scenario_payload (the namespace variable
used in the call to build_messages) so all three calls
('prime','transition','steady') share the same salted namespace for that
invocation; apply the same change to the other namespace-building function
referenced around lines 408-425 so every cache key includes the same run salt,
ensuring uniqueness across runs while preserving stability within a single run.
- Around line 25-33: The child subprocess currently expects a huge payload via
argv (CHILD_CODE using json.loads(sys.argv[1])) which breaks on Windows and has
no timeout; refactor the launcher to send the payload over stdin to the child
(keep CHILD_CODE but read from sys.stdin.buffer or sys.stdin.read instead of
sys.argv), add a bounded timeout to the subprocess invocation so provider stalls
can't hang, and wrap the parent-side json.loads(process.stdout) call in
try/except to catch JSONDecodeError and include the raw stdout/stderr in the
logged/raised error to make stdout contamination debuggable; update any calls
referencing honcho_llm_call_inner to use the stdin-based protocol and timeout
behavior.

---

Outside diff comments:
In `@src/deriver/deriver.py`:
- Around line 127-149: The call to honcho_llm_call in deriver.py sets prompt=""
and moves instructions into messages, which breaks the Google structured-output
path that still calls generate_content(... contents=prompt ...) when
response_model is set; either restore a combined prompt (e.g., concatenate
minimal_deriver_system_prompt(observed) +
minimal_deriver_user_prompt(formatted_messages)) when calling honcho_llm_call,
or (preferred) update the Google client in src/utils/clients.py (the
generate_content/Google structured-output branch) to accept and use the messages
parameter when response_model is present instead of only using prompt; after
fixing, add a regression test that sets settings.DERIVER.PROVIDER == "google"
and calls honcho_llm_call with response_model=PromptRepresentation and messages
to ensure the instructions+conversation are sent and parsed correctly.

---

Nitpick comments:
In `@scripts/compare_prefix_cache.py`:
- Around line 259-274: The change_base scenario currently uses a user_query
asking for a city that never appears in the fixture blocks, which conflates
cache behavior with model hallucination/abstention; update the "change_base"
entries (prime, transition, steady) to use an answerable prompt that references
a city present in the fixtures (keep BASE_PREFIX, BASE_PREFIX_VARIANT, and
ROLLING_HISTORY_A as-is) so the comparison focuses on cache behavior rather than
factual recall—modify the "user_query" values under the "change_base" dict to a
question that can be answered from the fixture content.
- Around line 119-129: Add Google-style docstrings to each public dataclass and
top-level helper in this module (including ProviderSpec and VariantSpec and the
functions introduced between lines 132-438) so the probe contract is explicit.
For each dataclass (ProviderSpec, VariantSpec) add a short one-line summary,
followed by Args describing each field (label, provider, model for ProviderSpec;
label, worktree for VariantSpec). For each top-level helper function, add a
Google-style docstring with a one-line summary, Args for parameters, Returns for
return values, and Raises if exceptions are expected. Ensure docstrings are
placed directly above the class/function definitions and use consistent
formatting across the file.

In `@src/deriver/prompts.py`:
- Around line 51-69: The combined prompt in minimal_deriver_prompt is
hand-wrapping <messages> instead of reusing the split helper; change
minimal_deriver_prompt to build its return by concatenating
minimal_deriver_system_prompt(peer_id) with minimal_deriver_user_prompt(...)
rather than manually injecting the <messages> wrapper so the template matches
estimate_minimal_deriver_prompt_tokens and avoids prompt drift—locate
minimal_deriver_prompt and replace the manual block with a call to
minimal_deriver_user_prompt using the same arguments (and keep
minimal_deriver_system_prompt(peer_id) as before).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb68a9ad-af1d-48ea-9d70-716dde799eb9

📥 Commits

Reviewing files that changed from the base of the PR and between 24f94f3 and eebcbab.

📒 Files selected for processing (8)
  • scripts/compare_prefix_cache.py
  • src/deriver/deriver.py
  • src/deriver/prompts.py
  • src/dialectic/core.py
  • src/utils/clients.py
  • src/utils/summarizer.py
  • tests/utils/test_clients.py
  • tests/utils/test_summarizer.py
@adavyas adavyas changed the title fix/prefix-based-cache-optim Mar 14, 2026
@adavyas adavyas changed the title Fix: Optimizes Prefix-Based Caching Mar 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/deriver/prompts.py (1)

20-35: ⚠️ Potential issue | 🟠 Major

Resolve explicit-rule contradiction in examples.

Line 20 says extraction must be direct/explicit, but Lines 33-35 include inference-heavy outputs (e.g., “lives in NYC”, “completed high school”). That can drive non-explicit derivations and reduce output fidelity.

Suggested prompt correction
 [EXPLICIT] DEFINITION: Facts about {peer_id} that can be derived directly from their messages.
+  - Do not infer unstated facts from general/world knowledge.

 EXAMPLES:
 - EXPLICIT: "I just had my 25th birthday last Saturday" → "{peer_id} is 25 years old", "{peer_id}'s birthday is June 21st"
-- EXPLICIT: "I took my dog for a walk in NYC" → "{peer_id} has a dog", "{peer_id} lives in NYC"
-- EXPLICIT: "{peer_id} attended college" + general knowledge → "{peer_id} completed high school or equivalent"
+- EXPLICIT: "I took my dog for a walk in NYC" → "{peer_id} has a dog", "{peer_id} was in NYC"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/deriver/prompts.py` around lines 20 - 35, The EXPLICIT rule conflicts
with the EXAMPLES: update the prompt so examples match the explicit-only
extraction requirement by editing the EXPLICIT/RULES/EXAMPLES block—either (A)
tighten the EXAMPLES to only include directly stated conclusions (e.g., change
"{peer_id} lives in NYC" to "{peer_id} said they were in NYC" and "{peer_id}
completed high school or equivalent" to "{peer_id} said they attended college"),
or (B) if you intend to allow minimal inference, explicitly relax the EXPLICIT
definition to permit only logically trivial inferences and document that in the
EXPLICIT header; make this change where the "EXPLICIT:" heading and the EXAMPLES
lines appear so the examples and rule text are consistent.
🧹 Nitpick comments (3)
tests/utils/test_clients.py (1)

94-102: Consider adding edge case tests for completeness.

The test covers the happy path but doesn't test edge cases like None metadata or non-integer values that the implementation handles. While not blocking, additional test coverage would improve confidence.

🧪 Additional test cases to consider
def test_extract_gemini_cache_tokens_none_metadata(self):
    """None metadata should return (0, 0)."""
    cache_creation, cache_read = extract_gemini_cache_tokens(None)
    assert cache_creation == 0
    assert cache_read == 0

def test_extract_gemini_cache_tokens_missing_attribute(self):
    """Missing cached_content_token_count should return 0 for cache_read."""
    usage_metadata = Mock(spec=[])  # No attributes
    cache_creation, cache_read = extract_gemini_cache_tokens(usage_metadata)
    assert cache_creation == 0
    assert cache_read == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/test_clients.py` around lines 94 - 102, Add unit tests to cover
edge cases for extract_gemini_cache_tokens: add a test that passes None (e.g.,
test_extract_gemini_cache_tokens_none_metadata) and assert it returns (0, 0),
and add a test that passes a Mock with no cached_content_token_count attribute
(e.g., test_extract_gemini_cache_tokens_missing_attribute) and assert
cache_creation and cache_read are both 0; place these alongside the existing
test in tests/utils/test_clients.py and reference the
extract_gemini_cache_tokens function in the assertions.
scripts/compare_prefix_cache.py (1)

201-219: Consider validating that parsed components are non-empty.

The current check verifies structural format but allows empty strings (e.g., "=:" or "label=:"). Adding a check for non-empty components would provide clearer feedback.

Suggested validation
 def parse_provider_spec(raw: str) -> ProviderSpec:
     if "=" not in raw or ":" not in raw:
         raise ValueError(
             f"Invalid provider spec {raw!r}. Expected label=provider:model"
         )
     label, provider_model = raw.split("=", 1)
     provider, model = provider_model.split(":", 1)
+    if not label or not provider or not model:
+        raise ValueError(
+            f"Invalid provider spec {raw!r}. Label, provider, and model must be non-empty"
+        )
     return ProviderSpec(label=label, provider=provider, model=model)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/compare_prefix_cache.py` around lines 201 - 219, The
parse_provider_spec function currently only checks separators but allows empty
label/provider/model parts; update parse_provider_spec to validate that label,
provider, and model are non-empty after splitting and raise a ValueError with a
clear message if any are empty. Specifically, after the split that produces
label, provider, and model, add checks like if not label: raise ValueError(...),
and similarly for provider and model, ensuring the returned
ProviderSpec(label=label, provider=provider, model=model) only occurs when all
three components are non-empty.
src/deriver/prompts.py (1)

73-82: Use explicit exception types for token-estimate fallback.

Line 81 catches all exceptions, which can mask unexpected failures. Narrow this to expected tokenizer/input errors.

Suggested exception handling update
-    except Exception:
+    except (TypeError, ValueError):
         return 300

As per coding guidelines: "Use explicit error handling with appropriate exception types".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/deriver/prompts.py` around lines 73 - 82, The current blanket except in
the token estimate fallback hides unexpected errors; change the broad except in
the block that builds prompt using minimal_deriver_system_prompt,
minimal_deriver_user_prompt and calls estimate_tokens to catch only expected
tokenization/input errors (e.g., ValueError and the tokenizer-specific exception
your estimate_tokens wrapper raises such as TokenizationError or
TokenizerError), optionally log the caught exception, and still return the 300
fallback for those cases; do not swallow other exception types so unexpected
issues surface.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/clients.py`:
- Around line 137-140: system messages with non-string content are being
dropped: when iterating messages (check role and msg.get("content")) the code
only appends when content is a string, so system messages containing a list
(e.g., outputs from _normalize_cacheable_system_content) are skipped and lost
for downstream Gemini calls. Fix by detecting list content for the "system" role
and merging it into system_instruction_parts (e.g., extend
system_instruction_parts with the list or join its items into a single string)
so system instructions from both strings and lists are preserved; update the
branch that currently checks isinstance(msg.get("content"), str) to also handle
list types and ensure the resulting system_instruction_parts is compatible with
the Gemini call path.

---

Outside diff comments:
In `@src/deriver/prompts.py`:
- Around line 20-35: The EXPLICIT rule conflicts with the EXAMPLES: update the
prompt so examples match the explicit-only extraction requirement by editing the
EXPLICIT/RULES/EXAMPLES block—either (A) tighten the EXAMPLES to only include
directly stated conclusions (e.g., change "{peer_id} lives in NYC" to "{peer_id}
said they were in NYC" and "{peer_id} completed high school or equivalent" to
"{peer_id} said they attended college"), or (B) if you intend to allow minimal
inference, explicitly relax the EXPLICIT definition to permit only logically
trivial inferences and document that in the EXPLICIT header; make this change
where the "EXPLICIT:" heading and the EXAMPLES lines appear so the examples and
rule text are consistent.

---

Nitpick comments:
In `@scripts/compare_prefix_cache.py`:
- Around line 201-219: The parse_provider_spec function currently only checks
separators but allows empty label/provider/model parts; update
parse_provider_spec to validate that label, provider, and model are non-empty
after splitting and raise a ValueError with a clear message if any are empty.
Specifically, after the split that produces label, provider, and model, add
checks like if not label: raise ValueError(...), and similarly for provider and
model, ensuring the returned ProviderSpec(label=label, provider=provider,
model=model) only occurs when all three components are non-empty.

In `@src/deriver/prompts.py`:
- Around line 73-82: The current blanket except in the token estimate fallback
hides unexpected errors; change the broad except in the block that builds prompt
using minimal_deriver_system_prompt, minimal_deriver_user_prompt and calls
estimate_tokens to catch only expected tokenization/input errors (e.g.,
ValueError and the tokenizer-specific exception your estimate_tokens wrapper
raises such as TokenizationError or TokenizerError), optionally log the caught
exception, and still return the 300 fallback for those cases; do not swallow
other exception types so unexpected issues surface.

In `@tests/utils/test_clients.py`:
- Around line 94-102: Add unit tests to cover edge cases for
extract_gemini_cache_tokens: add a test that passes None (e.g.,
test_extract_gemini_cache_tokens_none_metadata) and assert it returns (0, 0),
and add a test that passes a Mock with no cached_content_token_count attribute
(e.g., test_extract_gemini_cache_tokens_missing_attribute) and assert
cache_creation and cache_read are both 0; place these alongside the existing
test in tests/utils/test_clients.py and reference the
extract_gemini_cache_tokens function in the assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3baef6e7-24dd-42c4-9fc0-b8558f6cbcc3

📥 Commits

Reviewing files that changed from the base of the PR and between 1b32a8a and 4ceec9f.

📒 Files selected for processing (4)
  • scripts/compare_prefix_cache.py
  • src/deriver/prompts.py
  • src/utils/clients.py
  • tests/utils/test_clients.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/deriver/prompts.py (1)

14-37: Move peer_id into the user message if you want the system prefix to stay cache-friendly.

The caller already sends separate system and user messages, but this system template still bakes peer_id into the cacheable block. That means the expensive prefix changes for every peer, so cache reuse is mostly limited to repeated calls for the same peer instead of amortizing across peers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/deriver/prompts.py` around lines 14 - 37, The system prompt function
minimal_deriver_system_prompt currently interpolates peer_id making the system
message non-cacheable; remove peer_id interpolation from
minimal_deriver_system_prompt so it returns a generic, cache-friendly system
template (no peer-specific values), and add a separate user-level prompt
function or message (e.g., minimal_deriver_user_prompt(peer_id) or caller-side
user message) that injects the peer_id into the user content; update the caller
to send the generic system prompt plus the peer-specific user prompt (containing
peer_id) so the expensive system prefix is reusable across peers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/clients.py`:
- Around line 161-167: The current Gemini conversion drops any non-"text" blocks
from msg["content"], causing loss of tool context like tool_use/tool_result when
retries switch providers; update the loop that builds text_parts/gemini_contents
so that instead of only appending {"text": ...} for text blocks it preserves
other block types by including them in the parts array (e.g., append the block
payload or a minimal mapping with block["type"] and its data) so gemini_contents
records both plain text parts and non-text parts (tool_use/tool_result) for the
given role without altering conversation_messages.

---

Nitpick comments:
In `@src/deriver/prompts.py`:
- Around line 14-37: The system prompt function minimal_deriver_system_prompt
currently interpolates peer_id making the system message non-cacheable; remove
peer_id interpolation from minimal_deriver_system_prompt so it returns a
generic, cache-friendly system template (no peer-specific values), and add a
separate user-level prompt function or message (e.g.,
minimal_deriver_user_prompt(peer_id) or caller-side user message) that injects
the peer_id into the user content; update the caller to send the generic system
prompt plus the peer-specific user prompt (containing peer_id) so the expensive
system prefix is reusable across peers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6bf7c450-3654-4ca8-8b13-b6533ebe5687

📥 Commits

Reviewing files that changed from the base of the PR and between 4ceec9f and 56924eb.

📒 Files selected for processing (4)
  • scripts/compare_prefix_cache.py
  • src/deriver/prompts.py
  • src/utils/clients.py
  • tests/utils/test_clients.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/clients.py`:
- Around line 182-196: The current code constructs function_response.name using
tool_use_id which is incorrect; update the logic that builds the parts list for
block_type == "tool_result" to look up the corresponding "tool_use" block in the
message history by matching its id (the value currently in
block.get("tool_use_id")), extract the declared function/tool name from that
tool_use block (e.g., a "tool" or "name" field inside the matching tool_use
entry) and set function_response.name to that actual function name; if no
matching tool_use or name field is found, fall back to the existing tool_use_id
string to avoid breaking, and keep the existing tool_result json/string
conversion and is_error mapping.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b63882e-3564-42c7-a471-33d88f8695a7

📥 Commits

Reviewing files that changed from the base of the PR and between 56924eb and d2ce887.

📒 Files selected for processing (2)
  • src/utils/clients.py
  • tests/utils/test_clients.py
Comment on lines +32 to +35
EXAMPLES:
- EXPLICIT: "I just had my 25th birthday last Saturday" → "{peer_id} is 25 years old", "{peer_id}'s birthday is June 21st"
- EXPLICIT: "I took my dog for a walk in NYC" → "{peer_id} has a dog", "{peer_id} lives in NYC"
- EXPLICIT: "{peer_id} attended college" + general knowledge → "{peer_id} completed high school or equivalent"
- EXPLICIT: "I just had my 25th birthday last Saturday" → "{peer_id} is 25 years old", "{peer_id} celebrated their birthday last Saturday"
- EXPLICIT: "I took my dog for a walk in NYC" → "{peer_id} has a dog", "{peer_id} said they took their dog for a walk in NYC"
- EXPLICIT: "{peer_id} attended college" → "{peer_id} said they attended college"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would not make these example changes because the goal of ther deriver is to extract meaningful explicit facts + not to repeat back the raw message

Returns:
Formatted prompt string for observation extraction.
"""
def minimal_deriver_system_prompt(peer_id: str) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping the peer_id in the system prompt results in the system prompt only being cached for deriver calls to the same peer. we could move the peer context to minimal_deriver_user_prompt but not sure if the deriver would perform worse in this case. what do you think @VVoruganti

Comment on lines +127 to +128
def _build_gemini_contents_from_messages(
messages: list[dict[str, Any]],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use this in the streaming gemini path too?

@Rajat-Ahuja1997
Copy link
Copy Markdown
Collaborator

generally looks good—I would revert the prompt examples, look into caching of the deriver prompt to see if we should extract peer_id, and handle the gemini streaming path

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the dialectic, session_history is now its own messages block. that info would be lost here as we only takes messages[0].content. this was a pre-existing issue but accidentally worked because we bundled session history into messages[0] but we should fix this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants