384 questions
0
votes
1
answer
72
views
PyTorch + Optuna causes random segmentation fault inside TransformerEncoderLayer (PyTorch 2.6, CUDA 12)
I'm running into a segmentation fault when training a Transformer model with PyTorch 2.6.0 and Optuna on CUDA (12.4).
The exact same code used to work fine the issue appeared only after using Optuna.
...
2
votes
1
answer
194
views
Llama_cookbook: why are labels not shifted for CausalLM?
I'm studying the llama_cookbok repo, in particular their finetuning example.
This example uses LlamaForCausalLM model and samsum_dataset (input: dialog, output: summary). Now, looking at how they ...
0
votes
0
answers
113
views
Why is attention scaled by sqrt(d_k) in Transformer architectures?
I have this code in transformer model:
keys = x @ W_key
queries = x @ W_query
values = x @ W_value
attention_scores = queries @ keys.T
# keys.shape[-1]**0.5: used to scale the attention scores before ...
0
votes
1
answer
39
views
Anomalous behavior of the attention layer for different input vectors
I am currently trying to implement the attention layer from the transformer architecture but it is not working as I expect. I have been unable to figure out what the problem is for several days now. ...
4
votes
1
answer
307
views
Load Phi 3 model extract attention layer and visualize it
I would like to visualize the attention layer of a Phi-3-medium-4k-instruct (or mini) model downloaded from hugging-face. In particular, I am using the following model, tokenizer:
import torch
from ...
0
votes
1
answer
293
views
Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)
In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...
1
vote
1
answer
322
views
Masked self-attention not working as expected when each token is masking also itself
I was developing a self-attentive module using Pytorch's nn.MultiheadAttention (MHA). My goal was to implement a causal mask that enforces each token to attend only to the tokens before itself, ...
2
votes
0
answers
193
views
MultiModal Cross attention
I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using ...
2
votes
2
answers
343
views
Normalization of token embeddings in BERT encoder blocks
Following the multi-headed attention layer in a BERT encoder block, is layer normalization done separately on the embedding of each token (i.e., one mean and variance per token embedding), or on the ...
3
votes
1
answer
589
views
Get the attention scores of a pretrained transformer in pytorch
I've been trying to look at the attention scores of a pretrained transformer when I pass specific data in. It's specifically a Pytorch Transformer. I've tried using forward hooks, but I'm only able to ...
0
votes
1
answer
68
views
PyTorch Linear operations vary widely after reshaping
Here's an example function that illustrates this problem. It's an attempt to get the Q matrix (for KV caching) of a concatenated QKV matrix (from a transformer large language model). Before the QKV ...
0
votes
1
answer
151
views
output of custom attention mechanism implementation does not match torch.nn.MultiheadAttention
I was trying to create my own attention function for a project I'm working on. However, when I compared the output and weights from my code with those from torch.nn.MultiheadAttention, I noticed that ...
1
vote
0
answers
86
views
multihead self-attention for sentiment analysis not accurate results
i am trying to implement a model for sentiment analysis in text data using self-attention. In this example, i am using multi-head attention but cannot be sure if the results are accurate or not. It ...
1
vote
0
answers
37
views
Shape of Data2Vec output dimensions
I'm using Data2Vec from the Huggingface hub to feature extract on three modalities of a dataset. After processing I have tensors the shape of [1,768] for text, [1,499,768] for audio, and [1,197,768] ...
1
vote
0
answers
129
views
Attention Tensor Shape meaning
While I'm trying to extract Attention from a model, I see that there is a part where the attention changes its shape after matmul() with v (value).
The shape goes from:
attention_probs 2 shape: torch....