0

Being inspired by "Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model", I decided to follow through with the caveat of the paper. However, I get confused about setting offset variables during quantization.

INPUT : A (tensor of FP32, [1, 4, 1024, 256])

# Quantization
offset = torch.empty(A.shape)
offset = torch.zeros_like(offset)
scale = 255 / (torch.max(A) - torch.min(A))
A_int8 = (A - offset) * scale

# Probability Distribution
P = norm.pdf(A, torch.mean(A, dim=[2, 3]), torch.std(A, dim = [2,3]))
Q = norm.pdf(A_int8, torch.mean(A_int8, dim=[2, 3]), torch.std(A_int8, dim = [2,3]))
P = torch.from_numpy(P)
Q = torch.from_numpy(Q)

# KLD
kld = (P * (P / Q).log()).sum()
print(kld)    

# After this, I'm going to apply self-attention operation.
# B_int8 = A_int8.clone()
# AB = A_int8.matmul(B_int8.transpose(-1, -2))

I get positive kld value for now, but I'm not sure that I went through the right way to do it. Any help or advice is appreciated.

1 Answer 1

0

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P.

KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x))

This is the same as the positive sum of probability of each event in P multiplied by the log of the probability of the event in P over the probability of the event in Q.

KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))

"The K-L divergence is only defined if P and Q both sum to 1 and if Q(i) > 0 for any i such that P(i) > 0."

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.