1

I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. 
Please use the model as it is...

What I’ve Done So Far:

  • I'm not calling .to(...) anywhere — explicitly removed all such lines.✅

  • I'm using quantization_config=BitsAndBytesConfig(...) with load_in_4bit=True.✅

  • I removed device_map="auto" as per the transformers GitHub issue✅

  • I'm calling .cuda() only once on the model after .from_pretrained(...), as suggested ✅

  • Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set ✅

  • The system detects CUDA correctly: torch.cuda.is_available() is True

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything ✅

Here’s the relevant part of the code that triggers the error:

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)

Yet I still get the same ValueError.

1 Answer 1

0

The ValueError error persists because calling .cuda() on a 4-bit quantized model is not allowed with the BitsAndBytes integration in Hugging Face Transformers. When you use load_in_4bit=True in BitsAndBytesConfig, the model is automatically placed on the GPU (if available) during loading, and subsequent calls to .cuda() or .to() are unsupported and will raise this error.

Since you've already removed device_map="auto" and confirmed CUDA is detected (torch.cuda.is_available() == True), the issue lies in the .cuda() call after from_pretrained(). For 4-bit models, you should avoid manually moving the model to the GPU since BitsAndBytes handles this internally.

Update your code to remove the .cuda() call entirely, and you're good then.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.