Positional encoding without input embedding

Question

Does it make sense to use a positional encoding in attention when the input tokens do not go through an embedding layer?

In NLP models, the embedding maps a word to real numbers. "hello" might map to some real numbers [a, b, c]. Then, after adding positional encoding [e1, e2, e3], the attention layers will see [a + e1, b + e2, c + e3]. Since the network has seen this "hello" embedding before, it can separate out [e1, e2, e3] from [a, b, c], understanding both the token itself and the token's position.

Now imagine we are doing something like detecting particles, where an embedding layer does not make sense. Rather than a set of possible words, the input to the attention layer is from some continuous domain (like the positions of said particles). Can the attention layer still effectively factor out [e1, e2, e3] when added to some vector in $\mathbb{R}^3$? How does it know the value of e1 if a could be any value in $\mathbb{R}$?.

I know there are some papers that use transformers without embeddings, but do any show that the positional embedding becomes anything more than sinusoidal noise?

Jindřich · Accepted Answer · 2022-02-04 08:19:57Z

It does not make a difference if the inputs to the transformer are embeddings or vectors of a different kind. I guess the answer to the question if the attention can factor out the positional vectors is probably yes, but I do not think this is the right question. The question that you should ask is:

Does the transformer need to know the order/positions of the inputs to do the task well or is the input an unordered set?
If yes, is the position already encoded in the input?
If yes, adding position embeddings might help, otherwise, probably not.

The setup that you describe might be similar to vision-and-language models from NLP, such as UNITER where continuous image-region representations are used as an input to the transformer model. These models do not use traditional additive position embeddings and rather concatenate the image-region representations with the metadata describing the position in the image (x and y coordinates, width, and height).

Pre-trained language models such as BERT use learned position embeddings instead of the original sinusoidal ones. This also might an option for you.

Stack Exchange Network

Positional encoding without input embedding

1 Answer 1

Hot Network Questions

Positional encoding without input embedding

1 Answer 1

Related

Hot Network Questions