1
$\begingroup$

Does it make sense to use a positional encoding in attention when the input tokens do not go through an embedding layer?

In NLP models, the embedding maps a word to real numbers. "hello" might map to some real numbers [a, b, c]. Then, after adding positional encoding [e1, e2, e3], the attention layers will see [a + e1, b + e2, c + e3]. Since the network has seen this "hello" embedding before, it can separate out [e1, e2, e3] from [a, b, c], understanding both the token itself and the token's position.

Now imagine we are doing something like detecting particles, where an embedding layer does not make sense. Rather than a set of possible words, the input to the attention layer is from some continuous domain (like the positions of said particles). Can the attention layer still effectively factor out [e1, e2, e3] when added to some vector in $\mathbb{R}^3$? How does it know the value of e1 if a could be any value in $\mathbb{R}$?.

I know there are some papers that use transformers without embeddings, but do any show that the positional embedding becomes anything more than sinusoidal noise?

$\endgroup$

1 Answer 1

1
$\begingroup$

It does not make a difference if the inputs to the transformer are embeddings or vectors of a different kind. I guess the answer to the question if the attention can factor out the positional vectors is probably yes, but I do not think this is the right question. The question that you should ask is:

  • Does the transformer need to know the order/positions of the inputs to do the task well or is the input an unordered set?

  • If yes, is the position already encoded in the input?

  • If yes, adding position embeddings might help, otherwise, probably not.

The setup that you describe might be similar to vision-and-language models from NLP, such as UNITER where continuous image-region representations are used as an input to the transformer model. These models do not use traditional additive position embeddings and rather concatenate the image-region representations with the metadata describing the position in the image (x and y coordinates, width, and height).

Pre-trained language models such as BERT use learned position embeddings instead of the original sinusoidal ones. This also might an option for you.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.