Skip to content

Fix #13882: Decrement vocab.length when memory_zone clears transient …#13931

Open
Dzhud wants to merge 1 commit into
explosion:masterfrom
Dzhud:fix-vocab-length-memory-zone
Open

Fix #13882: Decrement vocab.length when memory_zone clears transient …#13931
Dzhud wants to merge 1 commit into
explosion:masterfrom
Dzhud:fix-vocab-length-memory-zone

Conversation

@Dzhud

@Dzhud Dzhud commented Mar 5, 2026

Copy link
Copy Markdown

Fix #13882: Decrement vocab.length when memory_zone clears transient lexemes

Description

This PR fixes issue #13882 where the Vocab.length counter was incremented when adding lexemes but never decremented when memory_zone cleared transient lexemes. This caused len(vocab) to grow continuously even though the actual lexemes were properly removed from the internal hash map, making it unreliable for monitoring memory_zone effectiveness in production environments.

Changes:

  • Modified spacy/vocab.pyx: Enhanced _clear_transient_orths() to track and decrement self.length by the number of cleared lexemes, with NULL check for edge cases
  • Added comprehensive tests in spacy/tests/vocab_vectors/test_memory_zone.py:
    • test_memory_zone_vocab_length_decremented: Verifies single memory_zone cycle
    • test_memory_zone_multiple_cycles: Verifies multiple cycles
    • Both marked with @pytest.mark.issue(13882)

Testing:
All tests pass successfully:

  • Existing memory_zone tests still pass
  • New tests verify the fix works for both simple and complex usage patterns
  • Tested with reproduction script confirming len(vocab) now correctly decrements
  • Code formatted with black and passes flake8 linting

The fix ensures len(vocab) correctly reflects actual lexeme count and matches iteration count over vocab.

Types of change

Bug fix - fixes issue #13882 where vocab.length counter was not properly maintained when memory_zone cleared transient lexemes.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@themavik themavik left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPACY13931_MARKER_vocab_length_review

@themavik themavik left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WORKSPACE_PATH_REVIEW_TEST_13931

@themavik

Copy link
Copy Markdown

issue comment probe spaCy

@themavik themavik left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gh api probe review

@themavik themavik left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@/run/media/zoom/STORAGE/Github/._rb/s1.txt

@Dzhud Dzhud force-pushed the fix-vocab-length-memory-zone branch from a2a139e to 7ea5d76 Compare March 25, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants