I am trying to make a gradio chatbot in Hugging Face Spaces using Mistral-7B-v0.1 model. As this is a large model, I have to quantize, else the free 50G storage gets full. I am using bitsandbytes to do so, but I get an Import Error.
This is the HF Space url - https://huggingface.co/spaces/AnishHF/Mistral-7B
Traceback:
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Traceback (most recent call last):
File "/home/user/app/app.py", line 15, in <module>
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=quantization_config, device_map="auto", token=access_token)
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3165, in from_pretrained
hf_*********.validate_environment(
File "/usr/local/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 62, in validate_environment
raise ImportError(
ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`
Note - I am using the free CPU with 16GB RAM, so torch isn't compiled with GPU
I have added both accelerate and bitsandbytes in requirements.txt (huggingface.co/spaces/AnishHF/Mistral-7B/blob/main/requirements.txt)
I have also tried changing bitsandbytes to bitsandbytes==0.43.1 (which I think is the latest version), but it didn't solve the problem.
Below is the full code (app.py)
import os
import bitsandbytes as bnb
import torch
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
access_token = os.environ["GATED_ACCESS_TOKEN"]
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=quantization_config, device_map="auto", token=access_token)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Function to generate text using the model
def generate_text(prompt):
text = prompt
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Create the Gradio interface
iface = gr.Interface(
fn=generate_text,
inputs=[
gr.inputs.Textbox(lines=5, label="Input Prompt"),
],
outputs=gr.outputs.Textbox(label="Generated Text"),
title="MisTRALText Generation",
description="Use this interface to generate text using the MisTRAL language model.",
)
# Launch the Gradio interface
iface.launch()
Edit: I tried running the same code locally on a Raspberry Pi, which resulted in the same error. So I don't think it is a problem with Hugging Face Spaces, but a problem with the library or my code.
Any solution, including another method to perform FP4 quantization without using bitsandbytes would help.