Newest 'quantization' Questions

-1 votes

0 answers

20 views

YOLOv8s TensorRT INT8 engine produces wrong bounding boxes with saturated confidence scores on Jetson Orin

I'm trying to quantize a YOLOv8s model to INT8 using TensorRT on a Jetson Orin (JetPack, TensorRT 8.6.2, Ultralytics 8.2.83, CUDA 12.2). The FP16 engine works correctly but the INT8 engine produces ...

Adel Ali Taleb

77

asked 13 hours ago

3 votes

0 answers

32 views

How to convert the MLP in MoE to 4 bit quantization?

I'm doing some research about the information encoding with LLMs and need to find a way to quantize the weights of the MLP layers(MoE) to 4 bits and even customized mixed precision. Consider from ...

ShoutOutAndCalculate

623

asked Mar 18 at 14:38

0 votes

0 answers

31 views

post training quantized model gets the error "Copying from quantized Tensor to non-quantized Tensor is not allowed" even though I'm not copying tensor

I got a pretrained resnet 18 model from this lane detection repo in order to use it as an ADAS(advanced driver assistance systems) function for an electric car making competition. My current goal is ...

Ekim

3

asked Feb 6 at 14:02

0 votes

1 answer

124 views

Apply Quantization on a CNN

I want to apply a quantization function to a deep CNN. This CNN is used for an image classification(in 4 classes) task, and my data consists of 224×224 images. When I run this code, I get an error. ...

jasmine

31

asked Dec 9, 2025 at 11:36

2 votes

0 answers

99 views

Issue Replicating TF-Lite Conv2D Quantized Inference Output

I am trying to reproduce the exact layer-wise output of a quantized EfficientNet model (TFLite model, TensorFlow 2.17) by re-implementing Conv2D, DepthwiseConv2D, FullyConnected, Add, Mul, Sub and ...

Jolverine

1

asked Nov 17, 2025 at 8:58

0 votes

2 answers

237 views

Why does TFLite INT8 quantization decompose BatchMatMul (from Einsum) into many FullyConnected layers?

I’m debugging a model conversion using onnx2tf and post-training quantization issue involving Einsum, BatchMatMul, and FullyConnected layers across different model formats. Pipeline: ONNX → TF ...

Saurav Rai

2,217

asked Nov 13, 2025 at 11:26

0 votes

0 answers

58 views

Error while converting quantized Torch model to ONNX

I’m applying QAT to YOLOv8n model with the following configuration: QConfig( activation=FakeQuantize.with_args( observer=MovingAverageMinMaxObserver, quant_min=0, quant_max=...

Matteo

111

asked Sep 5, 2025 at 14:39

1 vote

0 answers

42 views

Quantization In Tensorflow2, Instance error

I am trying to quantize a model in tensorflow using tfmot. This is a sample model, inputs = keras.layers.Input(shape=(512, 512, 1)) x = keras.layers.Conv2D(3, kernel_size=1, padding='same')(inputs) x =...

Sai

11

asked Aug 29, 2025 at 17:03

0 votes

1 answer

313 views

RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

I'm trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation here), but I'm getting ...

Sankalp Dhupar

71

asked Jul 21, 2025 at 23:41

1 vote

0 answers

152 views

Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method

I’ve been working on fine-tuning LLaMA 2–7B using QLoRA with bitsandbytes 4-bit quantization and ran into a weird issue. I did adaptive pretraining on Arabic data with a custom tokenizer (vocab size ~...

orchid Ali

11

asked Jun 26, 2025 at 17:50

0 votes

2 answers

69 views

Straight-Through estimation for vector quantization inside a recurrent neural network

in my model, I use vector quantization (VQ) inside a recurrent neural network. The VQ is trained using straight-through estimation with that particular code being identical to [1]: ...

Cola Lightyear

23

asked Jun 11, 2025 at 11:46

0 votes

0 answers

246 views

Cannot use bitsandbytes for quantization of LLM

I am using LLM, and I want to use quantization to boost the inference process. I am using the Nvidia Jetson AGX Orin GPU, which is an ARM-based architecture. I use this code model_name = "tiiuae/...

Chawki-Hjaiji

23

asked May 14, 2025 at 13:03

0 votes

0 answers

41 views

Mismatch between PyTorch inference and manual implementation

I’m trying to manually reproduce the inference forward-pass to understand exactly how quantized inference works. To do so, I trained and quantized a model in PyTorch using QAT, manually simulate the ...

greifswald

1

asked Apr 28, 2025 at 19:06

1 vote

0 answers

109 views

how to convert a QAT quantization aware trained tensorflow graph into tflite model?

I have am quantizing a neural network using QAT and I want to convert it into tflite. Quantization nodes get added to the skeleton graph and we get a new graph. I am able to load the trained QAT ...

Prateek Sharma

11

asked Apr 8, 2025 at 9:08

0 votes

0 answers

47 views

Stable Diffusion v1.4 PTQ on both weight and activation

I'm currently working on quantizing the Stable Diffusion v1.4 checkpoint without relying on external libraries such as torch.quantization or other quantization toolkits. I’m exploring two scenarios: ...

DOGLOPER

1

asked Apr 4, 2025 at 10:06

Collectives™ on Stack Overflow

YOLOv8s TensorRT INT8 engine produces wrong bounding boxes with saturated confidence scores on Jetson Orin

How to convert the MLP in MoE to 4 bit quantization?

post training quantized model gets the error "Copying from quantized Tensor to non-quantized Tensor is not allowed" even though I'm not copying tensor

Apply Quantization on a CNN

Issue Replicating TF-Lite Conv2D Quantized Inference Output

Why does TFLite INT8 quantization decompose BatchMatMul (from Einsum) into many FullyConnected layers?

Error while converting quantized Torch model to ONNX

Quantization In Tensorflow2, Instance error

RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

Fine-tuned LLaMA 2–7B with QLoRA, but reloading fails: missing 4bit metadata. Likely saved after LoRA+resize. Need proper 4bit save method

Straight-Through estimation for vector quantization inside a recurrent neural network

Cannot use bitsandbytes for quantization of LLM

Mismatch between PyTorch inference and manual implementation

how to convert a QAT quantization aware trained tensorflow graph into tflite model?

Stable Diffusion v1.4 PTQ on both weight and activation

Hot Network Questions