metrics for deriver / dialectic input + output tokens#274
Conversation
WalkthroughToken accounting was made explicit: added Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Deriver as Deriver.process_representation_tasks_batch
participant Prom as Prometheus
participant Reasoner as CertaintyReasoner.reason
participant LLM as critical_analysis_call
participant Provider as LLM Provider (streaming/non-streaming)
Client->>Deriver: submit tasks
Deriver->>Deriver: estimate components (peer, rep, prompt, new_turns)
Deriver->>Deriver: get session_context_tokens
Deriver->>Prom: track_input_tokens(task_type, {peer, rep, prompt, new_turns, session_context})
Deriver->>Reasoner: run reason()
Reasoner->>LLM: critical_analysis_call(...) -- no estimated_input_tokens param
LLM->>Provider: call provider
Provider-->>LLM: stream chunks (final chunk may include output_tokens)
LLM-->>Reasoner: response (response.output_tokens when available)
Reasoner->>Prom: increment output metric (token_type="output", component="total")
Reasoner-->>Deriver: representation result
Deriver-->>Client: final response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/test_llm_mock.py (1)
7-13: Fix invalid and ineffectiveisinstance(..., str | object)checksUsing
isinstance(short_result, str | object)/isinstance(long_result, str | object)is problematic:
- At runtime,
str | objectis a PEP‑604 union, not a plain type or tuple, soisinstancemay raiseTypeErrordepending on the Python version.- Even if it were accepted,
objectmatches everything, so the assertion is always true and doesn’t validate anything useful.Given the subsequent logic already distinguishes between string and “object with content”, a simpler pattern is:
- # For functions with return_call_response=True, we should get a string or object with content - # The existing mock returns a string, so we check if it's a string - assert isinstance(short_result, str | object) - assert isinstance(long_result, str | object) - # If it's not a string, check for content attribute - if not isinstance(short_result, str): - assert hasattr(short_result, "content") - if not isinstance(long_result, str): - assert hasattr(long_result, "content") + # For functions with return_call_response=True, we should get either a string + # or an object with a `content` attribute. + assert isinstance(short_result, str) or hasattr(short_result, "content") + assert isinstance(long_result, str) or hasattr(long_result, "content")This avoids runtime issues and directly encodes the behavior you care about.
Also applies to: 23-55, 80-90
src/deriver/deriver.py (1)
166-182: Input token estimation looks good, buttask_typelabel differs between input and outputThe input token estimation in
process_representation_tasks_batch:
- uses
estimate_base_prompt_tokens()+ dynamic components (peer card, working_representation, new turns, session context), and- passes their sum into
CertaintyReasonerasestimated_input_tokens,is consistent with the prior deriver estimation approach and should give a reasonable upper bound without double‑counting template overhead. Based on learnings.
In
CertaintyReasoner.reason, the subsequent metric increment is:prometheus.DERIVER_TOKENS_PROCESSED.labels( task_type="critical_analysis", token_type="input", ).inc(self.estimated_input_tokens)while the corresponding output tokens for the same logical call are recorded earlier with:
prometheus.DERIVER_TOKENS_PROCESSED.labels( task_type="representation", token_type="output", ).inc(response.output_tokens)This means input and output tokens for a single critical‑analysis/representation pass are labeled with different
task_typevalues, which makes per‑task input/output ratios harder to derive.If the intent is to attribute both input and output tokens to the same logical task, consider aligning the
task_typelabel (e.g., both"representation"or both"critical_analysis"). If they’re meant to be distinct series, a brief comment explaining the distinction would help future readers.Also applies to: 241-247, 345-348
src/utils/clients.py (1)
882-907: Wire Groq streamingoutput_tokensby enabling and extracting usage dataGroq's streaming API can return usage tokens on the final chunk, but it requires passing
stream_options={"include_usage": true}in the request. The current code at lines 882-907 doesn't include this parameter and doesn't check forchunk.usageon the final chunk, so output tokens are never surfaced.This creates a parity gap: Anthropic, OpenAI, and Gemini streaming implementations all populate
output_tokenson final chunks, allowingdialectic_stream(src/dialectic/chat.py:188) to increment output-token metrics. Groq streaming calls never increment these metrics becauseoutput_tokensremainsNone.Add
stream_optionstogroq_paramsand extractcompletion_tokensfrom the final chunk'susagefield (similar to how OpenAI handles it at line 812), then populate the final chunk'soutput_tokensfield.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
src/deriver/deriver.py(2 hunks)src/dialectic/chat.py(6 hunks)src/dialectic/prompts.py(2 hunks)src/prometheus.py(1 hunks)src/utils/clients.py(5 hunks)tests/test_llm_mock.py(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 144
File: src/crud.py:1250-1250
Timestamp: 2025-06-26T18:39:54.942Z
Learning: Rajat-Ahuja1997 is comfortable exposing token counts in MessageCreate schema as a public property when it improves code maintainability and avoids type checking errors.
📚 Learning: 2025-09-25T19:53:21.360Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Applied to files:
src/dialectic/chat.pysrc/deriver/deriver.pysrc/dialectic/prompts.py
📚 Learning: 2025-09-05T21:35:05.674Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 202
File: src/routers/peers.py:190-193
Timestamp: 2025-09-05T21:35:05.674Z
Learning: In src/routers/peers.py, the streaming chat endpoint yields raw chunk.content from HonchoLLMCallStreamChunk without SSE formatting, and this works correctly with their custom streaming client setup that consumes Anthropic API responses.
Applied to files:
src/utils/clients.py
🧬 Code graph analysis (4)
src/dialectic/chat.py (3)
src/dialectic/prompts.py (2)
dialectic_prompt(7-144)estimate_dialectic_prompt_tokens(148-167)src/utils/tokens.py (1)
estimate_tokens(6-15)src/prometheus.py (1)
labels(30-34)
src/deriver/deriver.py (1)
src/prometheus.py (1)
labels(30-34)
tests/test_llm_mock.py (2)
src/utils/representation.py (1)
PromptRepresentation(29-41)src/deriver/deriver.py (1)
critical_analysis_call(37-75)
src/dialectic/prompts.py (1)
src/utils/tokens.py (1)
estimate_tokens(6-15)
🪛 Ruff (0.14.5)
src/dialectic/chat.py
104-104: Possible hardcoded password assigned to argument: "token_type"
(S106)
108-108: Possible hardcoded password assigned to argument: "token_type"
(S106)
182-182: Possible hardcoded password assigned to argument: "token_type"
(S106)
186-186: Missing return type annotation for private function log_streaming_response
(ANN202)
190-190: Possible hardcoded password assigned to argument: "token_type"
(S106)
src/deriver/deriver.py
72-72: Possible hardcoded password assigned to argument: "token_type"
(S106)
347-347: Possible hardcoded password assigned to argument: "token_type"
(S106)
src/dialectic/prompts.py
165-165: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (9)
src/utils/clients.py (4)
96-111: Streaming chunk API extension looks consistentThe addition of
output_tokens: int | NoneonHonchoLLMCallStreamChunkwith the “only set on the final chunk” contract matches how the provider-specific streaming branches emit a final metadata-only chunk. This should compose cleanly with downstream code that already only cares aboutcontentfor intermediate chunks.
751-770: Anthropic streaming token extraction is soundUsing
anthropic_stream.get_final_message()and pullingusage.output_tokensinto the finalHonchoLLMCallStreamChunkaligns with Anthropic’s API and keeps intermediate chunks content-only. This should give accurate output token counts without affecting existing consumers that only streamchunk.content.
772-825: Verifystream_options={"include_usage": True}across all AsyncOpenAI backendsThe OpenAI streaming branch now unconditionally passes
stream_options={"include_usage": True}and expectschunk.usage.completion_tokenson a dedicated usage chunk. That’s correct for first‑party OpenAI, butAsyncOpenAI()is also used for"custom"and"vllm"providers here.If any OpenAI‑compatible backend doesn’t support
stream_optionsor theusagefield on streamed chunks, this path could raise or silently never setoutput_tokens(only triggering the “stream ended without usage chunk” warning).I recommend verifying:
- That each configured AsyncOpenAI backend (OpenAI, custom base_url, vLLM) accepts
stream_options.include_usage.- That they actually emit a usage chunk with
completion_tokens.If not, we may need a provider‑aware branch (only set
stream_optionsfor known‑supporting providers) or a fallback that estimates tokens instead of relying on streamed usage.
826-880: Gemini streaming output token handling aligns with non‑streaming, but confirm field shapeThe streaming Gemini branch pulls
final_chunk.usage_metadata.candidates_token_countintogemini_output_tokens, consistent with the non‑streaming path. Assumingcandidates_token_countis an int (or optional int) as in the client library, this will correctly populateoutput_tokenson the final chunk while leaving earlier chunks untouched.It’s worth double‑checking against the exact
google.genaiversion thatcandidates_token_countisn’t a list or nested structure; if it is, we’d need to sum or index it rather than pass it through directly.src/dialectic/chat.py (1)
63-74: Dialectic input/output token metrics for non‑streaming calls look correctThe pattern of:
- deriving
base_prompt_tokens = estimate_dialectic_prompt_tokens(), then- estimating dynamic input tokens from the concatenation of
query,working_representation, history, and peer cards, and- incrementing
DIALECTIC_TOKENS_PROCESSEDonce fortoken_type="input"and once fortoken_type="output"matches the deriver’s established approach of “base template + dynamic inputs” and should avoid double‑counting prompt wrapper overhead. The metric wiring and label usage also align with the new
DIALECTIC_TOKENS_PROCESSEDdefinition.Based on learnings
Also applies to: 86-110
src/prometheus.py (1)
96-110: New token metrics look consistent; confirm intentional deriver metric renameThe updated metrics definitions are coherent:
DERIVER_TOKENS_PROCESSED:"deriver_tokens_processed_total"with labels(namespace, task_type, token_type)matches how deriver code now passes bothtask_typeandtoken_type.DIALECTIC_TOKENS_PROCESSED:"dialectic_tokens_processed_total"with(namespace, token_type)matches the dialectic usage.One thing to double‑check: renaming the deriver metric from its previous name (
*_tokens_processed_total) is a breaking change for any existing Prometheus rules, dashboards, or alerts. If this rename is intentional, it’s worth noting in release notes or migration docs so observability doesn’t silently regress.Also applies to: 112-124
src/deriver/deriver.py (1)
36-75: Deriver output token tracking is wired correctly through critical_analysis_call
critical_analysis_callnow returns aPromptRepresentationand incrementsDERIVER_TOKENS_PROCESSEDwith:
task_type="representation"token_type="output"- amount
response.output_tokensThis matches the new metric signature and keeps the output token accounting close to the actual LLM call site, which is a good place for this responsibility.
src/dialectic/prompts.py (2)
1-1: LGTM: Imports support token estimation functionality.The new imports correctly support the base prompt token estimation feature, with
cacheenabling efficient caching andestimate_tokensproviding the token counting utility.Also applies to: 4-5
147-168: LGTM: Token estimation follows established pattern with appropriate error handling.The implementation correctly follows the deriver's approach of estimating base prompt tokens with empty inputs to avoid double-counting. The
@cachedecorator is well-suited since the base template only changes on redeploys. The broad exception handling flagged by static analysis is intentionally designed for robustness—it ensures the metrics pipeline doesn't fail on estimation errors and provides a conservative fallback for capacity planning.Based on learnings.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/prometheus.py (1)
84-112: Update DERIVER_TOKENS_PROCESSED doc comment to reflect new usage.The metric is now incremented from both the deriver (task_type="representation") and the summarizer (task_type="summary"), not just “after the critical analysis call is made”. Consider documenting the broader usage so future readers don’t assume it’s deriver‑only.
tests/utils/test_clients.py (3)
218-265: Consider asserting on output tokens in Anthropic streaming test.You now mock
mock_final_message.usage.output_tokensfor the Anthropic stream, but the test only checks content and finish reasons. To fully exercise the new token‑parsing behavior, consider asserting that the final chunk (or aggregate result) exposes the expected output token count derived fromusage.
627-679: Add coverage for Google/Gemini streaming output tokens.The final Google streaming chunk now carries
usage_metadata.candidates_token_count=35, but the test doesn’t assert any token value coming out ofhandle_streaming_response. Adding an assertion that the surfaced token count matchescandidates_token_countwould better validate the new output‑token parsing for this provider.
950-989: Extend main streaming call test to verify output token propagation.
test_streaming_callnow wiresmock_final_message.usage.output_tokens=28, but doesn’t assert that this value is available to callers (e.g., on the final chunk or in any aggregation). Given the PR’s focus on token metrics, consider adding an assertion that whatever surface you’ve chosen for streaming output tokens reflects this value.src/utils/summarizer.py (1)
362-377: Remove debug_create_and_save_summary.The
print("latest_summary", latest_summary)and related token‑countlogger.debugwith an appropriate message key.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
src/deriver/deriver.py(4 hunks)src/prometheus.py(1 hunks)src/utils/summarizer.py(7 hunks)tests/utils/test_clients.py(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 144
File: src/crud.py:1250-1250
Timestamp: 2025-06-26T18:39:54.942Z
Learning: Rajat-Ahuja1997 is comfortable exposing token counts in MessageCreate schema as a public property when it improves code maintainability and avoids type checking errors.
📚 Learning: 2025-09-25T19:53:21.360Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Applied to files:
src/prometheus.pysrc/deriver/deriver.pysrc/utils/summarizer.py
📚 Learning: 2025-10-09T20:37:29.932Z
Learnt from: dr-frmr
Repo: plastic-labs/honcho PR: 230
File: src/deriver/deriver.py:439-444
Timestamp: 2025-10-09T20:37:29.932Z
Learning: In src/deriver/deriver.py, the team prefers to use `accumulate_metric` with unit "blob" to log the full peer card content for debugging purposes. The content size is acceptable, and the logging infrastructure is designed to handle blob metrics separately from numeric metrics.
Applied to files:
src/deriver/deriver.py
🧬 Code graph analysis (2)
src/deriver/deriver.py (4)
src/utils/summarizer.py (1)
track_input_tokens(787-809)src/prometheus.py (1)
labels(30-34)src/utils/tokens.py (1)
estimate_tokens(6-15)src/utils/representation.py (1)
is_empty(129-133)
src/utils/summarizer.py (6)
src/utils/clients.py (5)
HonchoLLMCallResponse(80-93)honcho_llm_call(114-130)honcho_llm_call(134-149)honcho_llm_call(153-168)honcho_llm_call(172-324)src/utils/logging.py (4)
accumulate_metric(152-165)conditional_observe(36-38)conditional_observe(42-45)conditional_observe(48-80)src/utils/tokens.py (1)
estimate_tokens(6-15)src/models.py (1)
Message(204-277)src/deriver/deriver.py (1)
track_input_tokens(36-72)src/prometheus.py (1)
labels(30-34)
🪛 Ruff (0.14.5)
src/deriver/deriver.py
46-46: Possible hardcoded password assigned to argument: "token_type"
(S106)
52-52: Possible hardcoded password assigned to argument: "token_type"
(S106)
58-58: Possible hardcoded password assigned to argument: "token_type"
(S106)
64-64: Possible hardcoded password assigned to argument: "token_type"
(S106)
70-70: Possible hardcoded password assigned to argument: "token_type"
(S106)
111-111: Possible hardcoded password assigned to argument: "token_type"
(S106)
249-250: Explicitly concatenated string should be implicitly concatenated
Remove redundant '+' operator to implicitly concatenate
(ISC003)
249-250: Logging statement uses +
(G003)
src/utils/summarizer.py
161-161: Do not catch blind exception: Exception
(BLE001)
177-177: Do not catch blind exception: Exception
(BLE001)
394-394: Possible hardcoded password assigned to argument: "token_type"
(S106)
795-795: Possible hardcoded password assigned to argument: "token_type"
(S106)
801-801: Possible hardcoded password assigned to argument: "token_type"
(S106)
807-807: Possible hardcoded password assigned to argument: "token_type"
(S106)
🔇 Additional comments (4)
src/utils/summarizer.py (1)
787-809: Summary input‑token tracking helper looks consistent with deriver metrics.The
track_input_tokenshelper neatly decomposes summary inputs intobase_prompt,messages, andprevious_summarycomponents and usestask_type="summary"to distinguish from representation. This aligns with the expanded label set onDERIVER_TOKENS_PROCESSEDand should make downstream PromQL reasonably straightforward.src/deriver/deriver.py (3)
36-73: Input token tracking helper for representation aligns with prior design.Breaking representation inputs into
peer_card,working_representation,base_prompt,new_turns, andsession_contextcomponents undertask_type="representation"gives you the dimensionality requested in the earlier review while keeping label cardinality bounded. Looks good.
109-113: Output‑token metric oncritical_analysis_callis well‑placed.Incrementing
DERIVER_TOKENS_PROCESSEDfortask_type="representation",token_type="output",component="total"usingresponse.output_tokenscleanly ties LLM output usage to the deriver call. This should be easy to consume alongside the new input‑component metrics.
222-247: Confirm intended use ofestimated_input_tokensnow that it’s only for logging.After removing it from
critical_analysis_call,estimated_input_tokensis only used in the debug log (and to computeavailable_context_tokens). That’s fine, but worth confirming there’s no remaining consumer expecting this value elsewhere (e.g., observability dashboards). If nothing external depends on it, the current usage is coherent.
|
|
||
| @cache | ||
| def estimate_short_summary_base_prompt_tokens() -> int: | ||
| """Estimate tokens for the base short summary prompt (without messages/previous_summary).""" | ||
| try: | ||
| return estimate_tokens( | ||
| short_summary_prompt( | ||
| messages=[], | ||
| output_words=0, | ||
| previous_summary_text="", | ||
| ) | ||
| ) | ||
| except Exception: | ||
| # Return a rough estimate if estimation fails | ||
| return 200 | ||
|
|
||
|
|
||
| @cache | ||
| def estimate_long_summary_base_prompt_tokens() -> int: | ||
| """Estimate tokens for the base long summary prompt (without messages/previous_summary).""" | ||
| try: | ||
| return estimate_tokens( | ||
| long_summary_prompt( | ||
| messages=[], | ||
| output_words=0, | ||
| previous_summary_text="", | ||
| ) | ||
| ) | ||
| except Exception: | ||
| # Return a rough estimate if estimation fails | ||
| return 200 | ||
|
|
There was a problem hiding this comment.
Narrow or log the broad exception in base‑prompt token estimators.
Both estimate_short_summary_base_prompt_tokens and estimate_long_summary_base_prompt_tokens catch a bare Exception and silently fall back to 200. That’s fine as a last resort, but you might want to either (a) catch only the expected estimation failures or (b) at least log the exception so systematic tokenizer issues don’t go unnoticed.
🧰 Tools
🪛 Ruff (0.14.5)
161-161: Do not catch blind exception: Exception
(BLE001)
177-177: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In src/utils/summarizer.py around lines 149 to 180, the two base-prompt token
estimators swallow all exceptions and silently return 200; update the except
blocks to at least log the caught exception (or narrow to a specific
tokenizer/estimation error if available). Add a module-level logger (import
logging; logger = logging.getLogger(__name__)), then change each except to
except Exception as e: logger.exception("Failed to estimate base prompt tokens,
returning fallback"); return 200 (or replace Exception with the concrete
estimation error type if your tokenizer library exposes one).
| @conditional_observe(name="Create Short Summary") | ||
| async def create_short_summary( | ||
| messages: list[models.Message], | ||
| input_tokens: int, | ||
| previous_summary: str | None = None, | ||
| ) -> HonchoLLMCallResponse[str]: | ||
| # input_tokens indicates how many tokens the message list + previous summary take up | ||
| # we want to optimize short summaries to be smaller than the actual content being summarized | ||
| # so we ask the agent to produce a word count roughly equal to either the input, or the max | ||
| # size if the input is larger. the word/token ratio is roughly 4:3 so we multiply by 0.75. | ||
| # LLMs *seem* to respond better to getting asked for a word count but should workshop this. | ||
| output_words = int(min(input_tokens, settings.SUMMARY.MAX_TOKENS_SHORT) * 0.75) | ||
|
|
||
| if previous_summary: | ||
| previous_summary_text = previous_summary | ||
| else: | ||
| previous_summary_text = "There is no previous summary -- the messages are the beginning of the conversation." | ||
|
|
||
| prompt = short_summary_prompt(messages, output_words, previous_summary_text) | ||
|
|
||
| return await honcho_llm_call( | ||
| llm_settings=settings.SUMMARY, | ||
| prompt=prompt, | ||
| max_tokens=settings.SUMMARY.MAX_TOKENS_SHORT, | ||
| ) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Double‑check how default “no previous summary” text is accounted in token metrics.
When previous_summary is None, you inject a fixed sentence into <previous_summary>���</previous_summary>, but previous_summary_tokens is 0 in that case and estimate_*_base_prompt_tokens() was computed with previous_summary_text="". That default text’s tokens therefore aren’t counted in any of the summary input components. If you care about precise accounting, consider either:
- Treating that default sentence as part of the base prompt (update the estimator to include it), or
- Treating it as “previous_summary” and adding its estimated tokens into
previous_summary_tokens.
Would you like a small helper to estimate and include those tokens when latest_summary is absent?
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/utils/tokens.py (1)
3-3: Consistent tokenizer change - same verification needed.The tokenizer has been updated to
o200k_base, consistent with the change insrc/schemas.py. This module-level change affects all token estimation throughout the codebase, including deriver and dialectic components mentioned in the PR objectives.This is the same encoding change as in
src/schemas.pyline 139, so the same verification questions apply here regarding model compatibility and alignment with actual token consumption.Based on learnings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/schemas.py(1 hunks)src/utils/tokens.py(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 144
File: src/crud.py:1250-1250
Timestamp: 2025-06-26T18:39:54.942Z
Learning: Rajat-Ahuja1997 is comfortable exposing token counts in MessageCreate schema as a public property when it improves code maintainability and avoids type checking errors.
📚 Learning: 2025-09-25T19:52:22.185Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:14-14
Timestamp: 2025-09-25T19:52:22.185Z
Learning: The Honcho project uses a fine-tuned model with a distinct tokenizer that differs from standard tiktoken encodings, so they use cl100k encoding as a consistent approximation for token counting rather than model-specific encodings.
Applied to files:
src/utils/tokens.py
📚 Learning: 2025-09-25T19:50:53.145Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/config.py:223-224
Timestamp: 2025-09-25T19:50:53.145Z
Learning: In src/config.py, the MAX_INPUT_TOKENS field has le=23000 constraint which is intentional due to model limitations, not a configuration restriction. The underlying AI model has a hard cap of 23000 tokens for input.
Applied to files:
src/utils/tokens.py
🧬 Code graph analysis (1)
src/schemas.py (1)
src/embedding_client.py (1)
encoding(389-391)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-python
VVoruganti
left a comment
There was a problem hiding this comment.
Logic all looks good but can we also instrument the llm call that makes a new peer card. Can label it as its own task type in prometheus even though it runs as a part of the representation task.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/tokens.py (1)
3-33: Input-token helper and encoder change look consistent; consider minor hardening
- Using
o200k_baseas the shared tokenizer and reusingestimate_tokenshere matches the project’s decision to use o200k for LLM/message accounting and keep embedding counting separate, so no change needed on that front. Based on learnings.track_input_tokenscorrectly fans out per‑component counts intoDERIVER_TOKENS_PROCESSEDwith labelstask_type,token_type, andcomponent, which matches the metric’s label schema and NamespacedCounter’s implicitnamespace.- If there’s any chance a caller might pass a negative
token_count(e.g., from subtractive adjustments),Counter.incwill raise; if that’s a realistic future risk, you could defensively skip or clamp negatives here, otherwise a brief comment that counts are guaranteed non‑negative would document the assumption.- The Ruff S106 warning on
"token_type"is just linter noise for this metric label; if it’s blocking CI, you’ll need a# noqa: S106rather than# nosec B106, which only affects Bandit.
♻️ Duplicate comments (2)
src/utils/summarizer.py (1)
150-163: Narrow or log the broad exception in base-prompt token estimators.Both
estimate_short_summary_prompt_tokensandestimate_long_summary_prompt_tokenscatch a bareExceptionand silently fall back to conservative estimates (200). Consider either catching only expected estimation failures or logging the exception so systematic tokenizer issues don't go unnoticed.src/deriver/deriver.py (1)
227-236: Minor: Use implicit string concatenation in debug log.The debug log uses explicit
+for string concatenation, which triggers lint warnings (ISC003/G003). Use implicit concatenation instead by placing the string literals adjacent.Apply this diff:
logger.debug( - "Token estimation - Peer card: %d, Working rep: %d, Base prompt: %d, " - + "New turns: %d, Session context: %d, Total estimated: %d", + "Token estimation - Peer card: %d, Working rep: %d, Base prompt: %d, " + "New turns: %d, Session context: %d, Total estimated: %d", peer_card_tokens, working_rep_tokens, prompt_tokens,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
src/deriver/deriver.py(6 hunks)src/deriver/prompts.py(1 hunks)src/dialectic/chat.py(6 hunks)src/dialectic/prompts.py(2 hunks)src/prometheus.py(2 hunks)src/utils/summarizer.py(8 hunks)src/utils/tokens.py(2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 274
File: src/schemas.py:139-139
Timestamp: 2025-11-19T18:39:52.512Z
Learning: In the Honcho project, message token counting uses o200k_base (for gpt-4o-mini LLM) while embedding token counting uses cl100k_base (for text-embedding-3-small/gemini-embedding-001). These do not need to align because they serve different purposes: LLM consumption tracking vs embedding API input validation.
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 144
File: src/crud.py:1250-1250
Timestamp: 2025-06-26T18:39:54.942Z
Learning: Rajat-Ahuja1997 is comfortable exposing token counts in MessageCreate schema as a public property when it improves code maintainability and avoids type checking errors.
📚 Learning: 2025-09-25T19:53:21.360Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:169-183
Timestamp: 2025-09-25T19:53:21.360Z
Learning: In the deriver token estimation, the approach uses empty inputs to get the base prompt template token count via `estimate_base_prompt_tokens()`, then manually adds tokens for actual inputs (peer_card, working_representation, new_turns). This avoids double-counting wrapper overhead since the base prompt already includes template structure.
Applied to files:
src/dialectic/chat.pysrc/deriver/prompts.pysrc/dialectic/prompts.pysrc/deriver/deriver.pysrc/utils/summarizer.pysrc/utils/tokens.py
📚 Learning: 2025-10-09T20:37:29.932Z
Learnt from: dr-frmr
Repo: plastic-labs/honcho PR: 230
File: src/deriver/deriver.py:439-444
Timestamp: 2025-10-09T20:37:29.932Z
Learning: In src/deriver/deriver.py, the team prefers to use `accumulate_metric` with unit "blob" to log the full peer card content for debugging purposes. The content size is acceptable, and the logging infrastructure is designed to handle blob metrics separately from numeric metrics.
Applied to files:
src/deriver/deriver.py
📚 Learning: 2025-11-19T18:39:52.512Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 274
File: src/schemas.py:139-139
Timestamp: 2025-11-19T18:39:52.512Z
Learning: In the Honcho project, message token counting uses o200k_base (for gpt-4o-mini LLM) while embedding token counting uses cl100k_base (for text-embedding-3-small/gemini-embedding-001). These do not need to align because they serve different purposes: LLM consumption tracking vs embedding API input validation.
Applied to files:
src/utils/tokens.py
📚 Learning: 2025-09-25T19:50:53.145Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/config.py:223-224
Timestamp: 2025-09-25T19:50:53.145Z
Learning: In src/config.py, the MAX_INPUT_TOKENS field has le=23000 constraint which is intentional due to model limitations, not a configuration restriction. The underlying AI model has a hard cap of 23000 tokens for input.
Applied to files:
src/utils/tokens.py
📚 Learning: 2025-09-25T19:52:22.185Z
Learnt from: Rajat-Ahuja1997
Repo: plastic-labs/honcho PR: 218
File: src/deriver/deriver.py:14-14
Timestamp: 2025-09-25T19:52:22.185Z
Learning: The Honcho project uses a fine-tuned model with a distinct tokenizer that differs from standard tiktoken encodings, so they use cl100k encoding as a consistent approximation for token counting rather than model-specific encodings.
Applied to files:
src/utils/tokens.py
🧬 Code graph analysis (6)
src/dialectic/chat.py (3)
src/dialectic/prompts.py (2)
dialectic_prompt(7-144)estimate_dialectic_prompt_tokens(148-167)src/utils/tokens.py (1)
estimate_tokens(8-17)src/prometheus.py (1)
labels(30-34)
src/deriver/prompts.py (2)
src/utils/representation.py (1)
Representation(101-319)src/utils/tokens.py (1)
estimate_tokens(8-17)
src/dialectic/prompts.py (1)
src/utils/tokens.py (1)
estimate_tokens(8-17)
src/deriver/deriver.py (4)
src/utils/tokens.py (2)
estimate_tokens(8-17)track_input_tokens(20-33)src/deriver/prompts.py (3)
critical_analysis_prompt(16-146)estimate_critical_analysis_prompt_tokens(216-234)estimate_peer_card_prompt_tokens(238-252)src/utils/representation.py (2)
str_no_timestamps(77-79)str_no_timestamps(200-229)src/prometheus.py (1)
labels(30-34)
src/utils/summarizer.py (6)
src/utils/clients.py (5)
HonchoLLMCallResponse(80-93)honcho_llm_call(114-130)honcho_llm_call(134-149)honcho_llm_call(153-168)honcho_llm_call(172-324)src/utils/logging.py (3)
conditional_observe(36-38)conditional_observe(42-45)conditional_observe(48-80)src/utils/tokens.py (2)
estimate_tokens(8-17)track_input_tokens(20-33)src/schemas.py (2)
Message(154-168)Summary(236-250)src/models.py (1)
Message(204-277)src/prometheus.py (1)
labels(30-34)
src/utils/tokens.py (1)
src/prometheus.py (1)
labels(30-34)
🪛 Ruff (0.14.5)
src/dialectic/chat.py
104-104: Possible hardcoded password assigned to argument: "token_type"
(S106)
108-108: Possible hardcoded password assigned to argument: "token_type"
(S106)
182-182: Possible hardcoded password assigned to argument: "token_type"
(S106)
186-186: Missing return type annotation for private function log_streaming_response
(ANN202)
191-191: Possible hardcoded password assigned to argument: "token_type"
(S106)
src/deriver/prompts.py
232-232: Do not catch blind exception: Exception
(BLE001)
250-250: Do not catch blind exception: Exception
(BLE001)
src/dialectic/prompts.py
165-165: Do not catch blind exception: Exception
(BLE001)
src/deriver/deriver.py
73-73: Possible hardcoded password assigned to argument: "token_type"
(S106)
119-119: Possible hardcoded password assigned to argument: "token_type"
(S106)
228-229: Explicitly concatenated string should be implicitly concatenated
Remove redundant '+' operator to implicitly concatenate
(ISC003)
228-229: Logging statement uses +
(G003)
src/utils/summarizer.py
161-161: Do not catch blind exception: Exception
(BLE001)
177-177: Do not catch blind exception: Exception
(BLE001)
394-394: Possible hardcoded password assigned to argument: "token_type"
(S106)
src/utils/tokens.py
31-31: Possible hardcoded password assigned to argument: "token_type"
(S106)
🔇 Additional comments (14)
src/dialectic/chat.py (2)
139-195: Streaming dialectic token metrics wiring looks correct and consistentThe streaming path now mirrors the non‑streaming estimation (
estimate_dialectic_prompt_tokens+estimate_tokenson dynamic inputs) and increments input tokens once before streaming, which is consistent with how you handle deriver token accounting. The wrapper that checksif chunk.is_done and chunk.output_tokens is not Nonebefore incrementing output tokens avoids the earlier zero/None edge case, and the TODO about Groq documents the current provider gap without affecting behavior.No further changes needed here.
63-109: ****The review comment incorrectly assumes
honcho_llm_call()can return a response withoutput_tokens=None. Examining the actual implementation confirms the opposite: all non-streaming code paths inhoncho_llm_call()returnoutput_tokensas anint—either the real token count or 0 as a fallback. The type annotation correctly declaresoutput_tokens: int(non-optional) forHonchoLLMCallResponse. Only streaming chunks, typed separately asHonchoLLMCallStreamChunk, have optionaloutput_tokens(int | None), and the streaming path indialectic_streamalready guards correctly. The non-streaming path indialectic_call(line 109) is safe and follows the same pattern used inderiver.py.Likely an incorrect or invalid review comment.
src/prometheus.py (1)
84-126: Remove unfounded breaking-change concern; label schemas are correctAll calls to
DERIVER_TOKENS_PROCESSEDcorrectly include thecomponentlabel (component="total"in deriver.py and summarizer.py, dynamic in tokens.py), andDIALECTIC_TOKENS_PROCESSEDcalls correctly provide onlytoken_type. No evidence of a metric rename exists in the codebase—this appears to be a new metric addition, not a rename of an existing one. The operational warning about breaking dashboards/alerts applies only to actual renames; remove that concern from the review.Likely an incorrect or invalid review comment.
src/utils/summarizer.py (6)
84-113: LGTM! Well-structured prompt builder.The
short_summary_promptfunction is clear, properly parameterized, and usescleandocfor clean formatting.
116-147: LGTM! Long summary prompt appropriately comprehensive.The
long_summary_promptincludes additional contextual factors (emotional state, personality traits, themes) that are well-suited for thorough summaries.
182-207: LGTM! Summary creation logic is sound.The
create_short_summaryfunction correctly handles the case where no previous summary exists and appropriately calculates output word targets based on input tokens.Note: A past review flagged that the default "no previous summary" text at line 198 isn't counted in token estimates. This is a known concern that applies to both short and long summary functions.
209-229: LGTM! Consistent with short summary approach.The
create_long_summaryfunction follows the same pattern ascreate_short_summarywith appropriate adjustments for long-form summaries.
366-396: Excellent fallback handling for token metrics!The conditional token tracking (
if not is_fallback) correctly prevents recording metrics when summaries fall back to the hardcoded 50-token placeholder. This addresses the concern from previous reviews about misleading metrics during fallback scenarios.The granular component tracking (prompt, messages, previous_summary) provides good observability for token usage analysis.
Based on learnings about the project's token tracking approach.
427-486: LGTM! Return type change enables proper fallback detection.Returning
tuple[Summary, bool]allows the caller to distinguish between real LLM-generated summaries and fallback placeholders, which is essential for accurate token metrics. Theis_fallbackflag is correctly set only in the exception path.src/deriver/deriver.py (5)
71-75: LGTM! Output-only tracking at call site is appropriate.Moving input token tracking to the batch level (where all components are available) while keeping output tracking at the call site (where
response.output_tokensis available) is a sensible separation of concerns.
106-122: Excellent granular token tracking for peer card generation.The component breakdown (prompt, old_peer_card, new_observations) provides good observability. The approach follows the established pattern of using cached base prompt estimates plus actual content token counts.
Based on learnings about the deriver token estimation approach.
187-225: LGTM! Comprehensive token estimation across all input components.The token estimation logic correctly accounts for:
- Peer card tokens
- Working representation tokens
- Base prompt tokens (using cached estimator)
- New conversation turns tokens
- Session context tokens (added after retrieval)
The two-phase calculation (initial estimate + session context) is appropriate since session context depends on available token budget.
Based on learnings about the deriver token estimation approach.
239-248: Excellent granular component breakdown for representation task metrics.The component tracking provides the detailed breakdown requested in earlier discussions (peer_card, working_representation, prompt, new_turns, session_context), enabling precise analysis of token usage patterns.
340-377: LGTM! Removal ofestimated_input_tokensparameter simplifies the interface.The refactoring to track input tokens at the batch level (where all components are known) rather than passing them through the call chain is cleaner. The
critical_analysis_callcan focus on its core responsibility of generating representations.
| @cache | ||
| def estimate_base_prompt_tokens() -> int: | ||
| """Estimate base prompt tokens by calling critical_analysis_prompt with empty values. | ||
| def estimate_critical_analysis_prompt_tokens() -> int: | ||
| """Estimate critical analysis prompt tokens by calling critical_analysis_prompt with empty values. | ||
|
|
||
| This value is cached since it only changes on redeploys when the prompt template changes. | ||
| """ | ||
|
|
||
| try: | ||
| base_prompt = critical_analysis_prompt( | ||
| prompt = critical_analysis_prompt( | ||
| peer_id="", | ||
| peer_card=None, | ||
| message_created_at=datetime.datetime.now(datetime.timezone.utc), | ||
| working_representation=Representation(), | ||
| history="", | ||
| new_turns=[], | ||
| ) | ||
| return estimate_tokens(base_prompt) | ||
| return estimate_tokens(prompt) | ||
| except Exception: | ||
| # Return a conservative estimate if estimation fails | ||
| return 500 | ||
|
|
||
|
|
||
| @cache | ||
| def estimate_peer_card_prompt_tokens() -> int: | ||
| """Estimate peer card prompt tokens by calling peer_card_prompt with empty values. | ||
|
|
||
| This value is cached since it only changes on redeploys when the prompt template changes. | ||
| """ | ||
|
|
||
| try: | ||
| prompt = peer_card_prompt( | ||
| old_peer_card=None, | ||
| new_observations="", | ||
| ) | ||
| return estimate_tokens(prompt) | ||
| except Exception: | ||
| # Return a conservative estimate if estimation fails | ||
| return 400 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
New deriver prompt estimators match the base‑prompt pattern; maybe log or annotate broad exceptions
The two estimators follow the same “call prompt with empty values → estimate_tokens → cache” approach you already use elsewhere, so they should avoid double‑counting template overhead and stay stable across redeploys. Based on learnings.
Given these are heuristics, catching Exception and returning conservative fallbacks (500/400) is reasonable, but it does hide any real failures in prompt construction or tokenization. Two small tweaks you might consider:
- Log the exception at debug/info once before returning the fallback so template issues don’t go completely silent.
- If Ruff’s BLE001 is enforced, either narrow the exception type or explicitly mark this as intentional with
# noqa: BLE001.
🧰 Tools
🪛 Ruff (0.14.5)
232-232: Do not catch blind exception: Exception
(BLE001)
250-250: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In src/deriver/prompts.py around lines 215 to 252, the two estimator functions
swallow all Exceptions and return conservative defaults without any logging or
lint-safety; modify the except blocks to catch Exception as e and log the error
at debug (e.g., process_logger.debug or logging.getLogger(__name__).debug)
including the exception info before returning the fallback, and if Ruff’s BLE001
would complain about broad excepts either narrow the exception type if possible
or keep Exception but append a comment like "# noqa: BLE001" to indicate
intentional broad handling.
| @cache | ||
| def estimate_dialectic_prompt_tokens() -> int: | ||
| """Estimate base dialectic prompt tokens by calling dialectic_prompt with empty values. | ||
|
|
||
| This value is cached since it only changes on redeploys when the prompt template changes. | ||
| """ | ||
| try: | ||
| prompt = dialectic_prompt( | ||
| query="", | ||
| working_representation="", | ||
| recent_conversation_history=None, | ||
| observer_peer_card=None, | ||
| observed_peer_card=None, | ||
| observer="", | ||
| observed="", | ||
| ) | ||
|
|
||
| return estimate_tokens(prompt) | ||
| except Exception: | ||
| # Return a conservative estimate if estimation fails | ||
| return 750 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Cached base dialectic prompt estimator fits the overall token-accounting pattern
estimate_dialectic_prompt_tokens() cleanly centralizes the “empty-input prompt → estimate_tokens → cached” pattern, which lines up with how the deriver handles base prompt overhead and keeps dialectic token estimates consistent between call/stream paths. The 750 fallback also matches the existing hardcoded system‑prompt allowance in chat(), so behavior stays conservative even on failure. Based on learnings.
As with the deriver estimators, the blind except Exception is acceptable for a best‑effort heuristic, but if Ruff’s BLE001 is enforced you may want to either:
- Log the exception once (e.g., at debug) before returning 750, or
- Add an explicit
# noqa: BLE001to mark the broad catch as intentional.
🧰 Tools
🪛 Ruff (0.14.5)
165-165: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In src/dialectic/prompts.py around lines 147 to 167 the function uses a bare
except Exception which can trigger Ruff BLE001; fix by either (A) logging the
caught exception once (e.g., import a module logger and call
logger.debug("estimate_dialectic_prompt_tokens failed", exc_info=True) before
returning 750) or (B) explicitly mark the broad catch as intentional by
appending "# noqa: BLE001" to the except line; if you choose logging ensure the
logger is imported/available in the module.
Summary by CodeRabbit
New Features
Improvements
Other
✏️ Tip: You can customize this high-level summary in your review settings.