All Questions
2 questions
-1
votes
1
answer
2k
views
I load a float32 Hugging Face model, cast it to float16, and save it. How can I load it as float16?
I load a huggingface-transformers float32 model, cast it to float16, and save it. How can I load it as float16?
Example:
# pip install transformers
from transformers import ...
1
vote
1
answer
948
views
Can language model inference on a CPU, save memory by quantizing?
For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are:
Full: The model takes up 432.64GB
Quantized: 5.11GB * 8 = 40.88GB
The full model won't fit ...