Newest 'attention-model' Questions

0 votes

1 answer

72 views

PyTorch + Optuna causes random segmentation fault inside TransformerEncoderLayer (PyTorch 2.6, CUDA 12)

I'm running into a segmentation fault when training a Transformer model with PyTorch 2.6.0 and Optuna on CUDA (12.4). The exact same code used to work fine the issue appeared only after using Optuna. ...

Angelo

665

asked Oct 7, 2025 at 8:39

2 votes

1 answer

194 views

Llama_cookbook: why are labels not shifted for CausalLM?

I'm studying the llama_cookbok repo, in particular their finetuning example. This example uses LlamaForCausalLM model and samsum_dataset (input: dialog, output: summary). Now, looking at how they ...

Dmitry

340

asked Jun 1, 2025 at 4:29

0 votes

0 answers

113 views

Why is attention scaled by sqrt(d_k) in Transformer architectures?

I have this code in transformer model: keys = x @ W_key queries = x @ W_query values = x @ W_value attention_scores = queries @ keys.T # keys.shape[-1]**0.5: used to scale the attention scores before ...

Yilmaz

51.4k

asked May 25, 2025 at 21:48

0 votes

1 answer

39 views

Anomalous behavior of the attention layer for different input vectors

I am currently trying to implement the attention layer from the transformer architecture but it is not working as I expect. I have been unable to figure out what the problem is for several days now. ...

RB2k

1

asked Mar 5, 2025 at 1:15

4 votes

1 answer

307 views

Load Phi 3 model extract attention layer and visualize it

I would like to visualize the attention layer of a Phi-3-medium-4k-instruct (or mini) model downloaded from hugging-face. In particular, I am using the following model, tokenizer: import torch from ...

Jose Ramon

5,374

asked Feb 20, 2025 at 18:25

0 votes

1 answer

293 views

Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)

In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...

cangozpi

159

asked Feb 12, 2025 at 21:22

1 vote

1 answer

322 views

Masked self-attention not working as expected when each token is masking also itself

I was developing a self-attentive module using Pytorch's nn.MultiheadAttention (MHA). My goal was to implement a causal mask that enforces each token to attend only to the tokens before itself, ...

jackjack4468

15

asked Dec 16, 2024 at 14:55

2 votes

0 answers

193 views

MultiModal Cross attention

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using ...

m sh

21

asked Nov 21, 2024 at 3:16

2 votes

2 answers

343 views

Normalization of token embeddings in BERT encoder blocks

Following the multi-headed attention layer in a BERT encoder block, is layer normalization done separately on the embedding of each token (i.e., one mean and variance per token embedding), or on the ...

Fijoy Vadakkumpadan

688

asked Nov 11, 2024 at 14:30

3 votes

1 answer

589 views

Get the attention scores of a pretrained transformer in pytorch

I've been trying to look at the attention scores of a pretrained transformer when I pass specific data in. It's specifically a Pytorch Transformer. I've tried using forward hooks, but I'm only able to ...

Thomas

31

asked Sep 18, 2024 at 19:06

0 votes

1 answer

68 views

PyTorch Linear operations vary widely after reshaping

Here's an example function that illustrates this problem. It's an attempt to get the Q matrix (for KV caching) of a concatenated QKV matrix (from a transformer large language model). Before the QKV ...

genjong

137

asked Aug 24, 2024 at 19:52

0 votes

1 answer

151 views

output of custom attention mechanism implementation does not match torch.nn.MultiheadAttention

I was trying to create my own attention function for a project I'm working on. However, when I compared the output and weights from my code with those from torch.nn.MultiheadAttention, I noticed that ...

user26811297

3

asked Aug 14, 2024 at 0:45

1 vote

0 answers

86 views

multihead self-attention for sentiment analysis not accurate results

i am trying to implement a model for sentiment analysis in text data using self-attention. In this example, i am using multi-head attention but cannot be sure if the results are accurate or not. It ...

phd Mom

11

asked Jul 4, 2024 at 11:17

1 vote

0 answers

37 views

Shape of Data2Vec output dimensions

I'm using Data2Vec from the Huggingface hub to feature extract on three modalities of a dataset. After processing I have tensors the shape of [1,768] for text, [1,499,768] for audio, and [1,197,768] ...

Patrick Wu

11

asked Jul 3, 2024 at 17:14

1 vote

0 answers

129 views

Attention Tensor Shape meaning

While I'm trying to extract Attention from a model, I see that there is a part where the attention changes its shape after matmul() with v (value). The shape goes from: attention_probs 2 shape: torch....

Wassim Jaoui

95

asked May 17, 2024 at 7:49

Collectives™ on Stack Overflow

PyTorch + Optuna causes random segmentation fault inside TransformerEncoderLayer (PyTorch 2.6, CUDA 12)

Llama_cookbook: why are labels not shifted for CausalLM?

Why is attention scaled by sqrt(d_k) in Transformer architectures?

Anomalous behavior of the attention layer for different input vectors

Load Phi 3 model extract attention layer and visualize it

Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)

Masked self-attention not working as expected when each token is masking also itself

MultiModal Cross attention

Normalization of token embeddings in BERT encoder blocks

Get the attention scores of a pretrained transformer in pytorch

PyTorch Linear operations vary widely after reshaping

output of custom attention mechanism implementation does not match torch.nn.MultiheadAttention

multihead self-attention for sentiment analysis not accurate results

Shape of Data2Vec output dimensions

Attention Tensor Shape meaning

Hot Network Questions