Does it make sense to use a positional encoding in attention when the input tokens do not go through an embedding layer?
In NLP models, the embedding maps a word to real numbers. "hello" might map to some real numbers [a, b, c]. Then, after adding positional encoding [e1, e2, e3], the attention layers will see [a + e1, b + e2, c + e3]. Since the network has seen this "hello" embedding before, it can separate out [e1, e2, e3] from [a, b, c], understanding both the token itself and the token's position.
Now imagine we are doing something like detecting particles, where an embedding layer does not make sense. Rather than a set of possible words, the input to the attention layer is from some continuous domain (like the positions of said particles). Can the attention layer still effectively factor out [e1, e2, e3] when added to some vector in $\mathbb{R}^3$? How does it know the value of e1 if a could be any value in $\mathbb{R}$?.
I know there are some papers that use transformers without embeddings, but do any show that the positional embedding becomes anything more than sinusoidal noise?