Skip to content

feat(skills): smart ranking, usage tracking, and lifecycle management…#4406

Open
fathah wants to merge 3 commits intoNousResearch:mainfrom
fathah:skills-overflow-fix
Open

feat(skills): smart ranking, usage tracking, and lifecycle management…#4406
fathah wants to merge 3 commits intoNousResearch:mainfrom
fathah:skills-overflow-fix

Conversation

@fathah
Copy link
Copy Markdown

@fathah fathah commented Apr 1, 2026

What does this PR do?

Skills in the system prompt are now ranked by usage frequency + keyword relevance to the user's message, replacing the alphabetical dump that buried the right skills.

Also adds usage tracking, opt-in token budgets, auto-archival of stale skills, and CLI commands to manage skill health.

Problem

Every skill is injected into the system prompt alphabetically with no limits. With 98 skills, ml-paper-writing sits at position 86 and systematic-debugging at 95. The LLM scans through dozens of irrelevant skills before finding the one that matches — or gives up and improvises.

The system prompt is immune to context compression, so this gets worse over time as skills accumulate.

How it works

  1. Usage trackingskill_usage table (schema v7) records every view, invoke, and slash command. Scored with recency-weighted frequency in a single SQL query.
  2. Keyword relevance — Jaccard similarity between user message and skill metadata (name, description, tags), expanded with suffix stemming and a domain synonym map (tweet -> twitter, bug -> debug).
  3. Normalized merge — both signals normalized to 0-1 before combining. Relevance weighted 3x so query-relevant skills beat daily-driver habits.
  4. Flat output — when scores are active, skills listed in score order instead of grouped by category.

Related Issue

#4356 #4379 #4319 #4391 #4404

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • agent/prompt_builder.py — keyword relevance scoring, suffix stemmer, synonym map, token budget, normalized merge, flat ranked output
  • hermes_state.py — schema v7 migration with skill_usage table, ranking/stats/last-used queries, self-cleaning purge
  • tools/skill_manager_tool.py — archive/restore, bundled skill detection, dedup check on create, find_archivable_skills()
  • tools/skills_tool.py — usage tracking on skill_view, .archive exclusion, include_archived param, archive fallback with restore hint
  • agent/skill_commands.py — usage tracking on slash command invocations
  • agent/skill_utils.py.archive added to EXCLUDED_SKILL_DIRS
  • hermes_cli/config.pyskills config block (token_budget, max_prompt_skills, pinned_skills, auto_archive_days)
  • hermes_cli/main.py — argparse for stats/archive/restore/prune subcommands
  • hermes_cli/skills_config.py — CLI implementations for stats, archive, restore, prune
  • run_agent.py — loads skills config, computes usage scores, passes user_message to prompt builder, background auto-archive
  • tests/test_skills_overflow.py — 47 tests covering all new features

All config defaults preserve existing behavior (0 = unlimited/disabled). No breaking changes.

How to Test

  1. pytest tests/test_skills_overflow.py -v — 47 tests, all pass
  2. pytest tests/ -k skill -q — full skill test suite, 0 new regressions
  3. Start hermes with default config — all skills appear as before
  4. Set skills.token_budget: 4000 — skills section capped, footer shows omitted count
  5. hermes skills stats — shows usage data after interacting with skills
  6. hermes skills archive <name> then hermes skills restore <name>
  7. hermes skills prune --days 90 — lists unused skills, prompts for confirmation

Benchmark (98 real skills)

Query Before After
"write a research paper for NeurIPS" ml-paper-writing 86, arxiv 82 2, 3
"set up a vector database for RAG" qdrant 73, pinecone 72, chroma 70 5, 7, 8
"post a tweet about my project" xitter 90 2
"debug my python code that crashes" systematic-debugging 95 9
"find a restaurant nearby" find-nearby 27 1

Right skill in top 20: 29% -> 93%

End-to-end with gemma-3-4b: LLM picked the correct skill 6/6 vs 4/6 on alphabetical ordering.

fathah added 3 commits April 1, 2026 10:05
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
… Skills in the system prompt are now ranked by a combination of usage frequency and keyword relevance to the user's message, replacing the previous alphabetical dump. Adds a skill_usage table (schema v7) that tracks views, invocations, and management actions — feeding a normalized scoring system that surfaces the right skill for the task. New capabilities: - Token budget and max_prompt_skills caps (opt-in, defaults unchanged) - Pinned skills that survive budget cuts - Suffix stemming and domain synonym expansion for keyword matching - Auto-archival of stale skills (background thread, opt-in) - CLI: hermes skills stats/archive/restore/prune - Deduplication warnings on skill creation - Archived skills discoverable via skills_list(include_archived=True) Benchmark on 98 real skills: correct skill in top 20 improved from 29% to 93%. Verified end-to-end with LLM picking the right skill 6/6 vs 4/6 on alphabetical ordering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant