Memory usage keeps increasing when extracting embeddings via sentence-transformers

Question

I have a set of about 100M paragraph-sized strings (multilingual) I am extracting embeddings for, but the memory usage keeps increasing until I start overflowing into disk swap:

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", 
                             tokenizer_kwargs={"padding_side": "left"})


for samples_page in my_paginated_samples_loader:
    embeddings = model.encode(samples_page)
    my_paginated_writer.write(embeddings, disk_destination)

I have tested my sample loader for any memory leaks but it does not seem to be the issue, so it has to be something with Sentence Transformers.

Does the issue seem to be something obvious like the way I am running it being wrong? Or should I try to troubleshoot this in some specific way?

Setting a max_seq_length is helping, seems like the issue is related to the dataset being multilingual (lots of unique tokens) github.com/huggingface/sentence-transformers/issues/1795 — Layman
– Layman, Commented Oct 29 at 23:09
With the code snippet you have shared, it is no surprise that it keeps increasing, since you keep everything in memory (embedding list object). Can you please clarify that? — cronoik
– cronoik, Commented Nov 2 at 9:40
My bad, snippet was from my debug attempt -- fixed now. Embeddings do not get held in memory but written to disk for each page in the actual use that reproes the problem. — Layman
– Layman, Commented Nov 3 at 18:57

S Bonnet · Accepted Answer · 2025-10-30 19:28:33Z

Debugging: tracking memory usage

import gc
import psutil
import os

process = psutil.Process(os.getpid())

with torch.no_grad():
    for i, samples_page in enumerate(my_paginated_samples_loader):
        batch_embeddings = model.encode(samples_page, convert_to_numpy=True)
        embeddings.extend(batch_embeddings)
        
        if i % 10 == 0:  # Log every 10 batches
            mem_mb = process.memory_info().rss / 1024 / 1024
            print(f"Batch {i}: Memory usage: {mem_mb:.2f} MB")
            gc.collect()  # Force garbage collection

Try disabling gradient computation:


import torch

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", 
tokenizer_kwargs={"padding_side": "left"},
max_seq_length=512  # or 256, 384 depending on your needs
)

model.eval()  # Set to eval mode

embeddings = []

with torch.no_grad():  # Critical: disable gradient tracking
    for samples_page in my_paginated_samples_loader:
# to convert immediately + avoid tqdm overhead
        batch_embeddings = model.encode(samples_page,
convert_to_numpy=True, show_progress_bar=False
batch_size=16,
)

        embeddings.extend(batch_embeddings)
        
        # Optional: clear cache periodically
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

Incrementally write embeddings to disk

import numpy as np

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", 
                           tokenizer_kwargs={"padding_side": "left"})
model.eval()


embedding_dim = 768  # Check your model's dimension
mmap_file = np.memmap('embeddings.dat', dtype='float32', mode='w+', 
                      shape=(100_000_000, embedding_dim))

offset = 0
with torch.no_grad():
    for samples_page in my_paginated_samples_loader:
        batch_embeddings = model.encode(samples_page, convert_to_numpy=True)
        batch_size = len(batch_embeddings)
        mmap_file[offset:offset + batch_size] = batch_embeddings
        offset += batch_size
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

Ah definitely forgot the no_grad(), sadly I still have the memory leak
The encode() method has internal batch_size that might be too large (try smaller value). Otherwise tracking memory usage...
I was already setting a small page size in my loader (equivalent to batch size)... But the one thing that helped is the discussion link I pasted under the post. Seems the tokenizer blows up the memory usage if there are lots of unique tokens

Collectives™ on Stack Overflow

Memory usage keeps increasing when extracting embeddings via sentence-transformers

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related