Skip to main content
0 votes
1 answer
72 views

I'm running into a segmentation fault when training a Transformer model with PyTorch 2.6.0 and Optuna on CUDA (12.4). The exact same code used to work fine the issue appeared only after using Optuna. ...
Angelo's user avatar
  • 665
2 votes
1 answer
194 views

I'm studying the llama_cookbok repo, in particular their finetuning example. This example uses LlamaForCausalLM model and samsum_dataset (input: dialog, output: summary). Now, looking at how they ...
Dmitry's user avatar
  • 340
0 votes
0 answers
113 views

I have this code in transformer model: keys = x @ W_key queries = x @ W_query values = x @ W_value attention_scores = queries @ keys.T # keys.shape[-1]**0.5: used to scale the attention scores before ...
Yilmaz's user avatar
  • 51.4k
0 votes
1 answer
39 views

I am currently trying to implement the attention layer from the transformer architecture but it is not working as I expect. I have been unable to figure out what the problem is for several days now. ...
RB2k's user avatar
  • 1
4 votes
1 answer
307 views

I would like to visualize the attention layer of a Phi-3-medium-4k-instruct (or mini) model downloaded from hugging-face. In particular, I am using the following model, tokenizer: import torch from ...
Jose Ramon's user avatar
  • 5,374
0 votes
1 answer
293 views

In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...
cangozpi's user avatar
  • 159
1 vote
1 answer
322 views

I was developing a self-attentive module using Pytorch's nn.MultiheadAttention (MHA). My goal was to implement a causal mask that enforces each token to attend only to the tokens before itself, ...
jackjack4468's user avatar
2 votes
0 answers
193 views

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using ...
m sh's user avatar
  • 21
2 votes
2 answers
343 views

Following the multi-headed attention layer in a BERT encoder block, is layer normalization done separately on the embedding of each token (i.e., one mean and variance per token embedding), or on the ...
Fijoy Vadakkumpadan's user avatar
3 votes
1 answer
589 views

I've been trying to look at the attention scores of a pretrained transformer when I pass specific data in. It's specifically a Pytorch Transformer. I've tried using forward hooks, but I'm only able to ...
Thomas's user avatar
  • 31
0 votes
1 answer
68 views

Here's an example function that illustrates this problem. It's an attempt to get the Q matrix (for KV caching) of a concatenated QKV matrix (from a transformer large language model). Before the QKV ...
genjong's user avatar
  • 137
0 votes
1 answer
151 views

I was trying to create my own attention function for a project I'm working on. However, when I compared the output and weights from my code with those from torch.nn.MultiheadAttention, I noticed that ...
user26811297's user avatar
1 vote
0 answers
86 views

i am trying to implement a model for sentiment analysis in text data using self-attention. In this example, i am using multi-head attention but cannot be sure if the results are accurate or not. It ...
phd Mom's user avatar
  • 11
1 vote
0 answers
37 views

I'm using Data2Vec from the Huggingface hub to feature extract on three modalities of a dataset. After processing I have tensors the shape of [1,768] for text, [1,499,768] for audio, and [1,197,768] ...
Patrick Wu's user avatar
1 vote
0 answers
129 views

While I'm trying to extract Attention from a model, I see that there is a part where the attention changes its shape after matmul() with v (value). The shape goes from: attention_probs 2 shape: torch....
Wassim Jaoui's user avatar

15 30 50 per page
1
2 3 4 5
26