🚨 [v5] Remove relative position embeddings (for bert like models)#41170
Conversation
|
run-slow: flava, instructblib, mra |
There was a problem hiding this comment.
This is mostly due to me forgetting to update them in my bert refactor PR --> big diff because the whole refactor is included (same for the roberta example)
There was a problem hiding this comment.
Updated: Only includes the changes here now
|
This comment contains run-slow, running the specified jobs: models: ['models/flava', 'models/mra'] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: instructblip |
|
This comment contains run-slow, running the specified jobs: models: ['models/instructblip'] |
6046d27 to
0dbd18b
Compare
|
run-slow: bert, roberta, albert, mra, instructblip, blip_2, flava |
|
This comment contains run-slow, running the specified jobs: models: ['models/albert', 'models/bert', 'models/blip_2', 'models/flava', 'models/instructblip', 'models/mra', 'models/roberta'] |
|
Failing slow tests are the same as in main 👀 |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks, super nice clean-up! 🧼
| return embeddings | ||
|
|
||
|
|
||
| def eager_attention_forward( |
There was a problem hiding this comment.
i think now we can copy bert from llama or another big model group? 👀 Keeping less sources of truth makes it easier to submit PRs
There was a problem hiding this comment.
Let me make a follow-up PR for that, would like to sync bert and bart instead tho since llama would indicate causal masks which is not the case here + unnecessary gqa dependency from llama
| if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": | ||
| seq_length = hidden_states.size()[1] | ||
| position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1) | ||
| position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1) | ||
| distance = position_ids_l - position_ids_r | ||
| positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1) | ||
| positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility | ||
|
|
||
| if self.position_embedding_type == "relative_key": | ||
| relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) | ||
| attention_scores = attention_scores + relative_position_scores | ||
| elif self.position_embedding_type == "relative_key_query": | ||
| relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) | ||
| relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding) | ||
| attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key | ||
|
|
There was a problem hiding this comment.
happy to see it, I assumed that BLIP models use relative positions haha. Now it can support attention implementation API in qformer 🙌🏻
There was a problem hiding this comment.
yea, it's honestly a bit baffling how many models have this while not using it at all 👀
|
[For maintainers] Suggested jobs to run (before merge) run-slow: albert, align, altclip, bert, bert_generation, big_bird, blip, blip_2, bridgetower, bros, camembert, canine, chinese_clip, clap, data2vec, dpr |
|
Merging as nothing has changed, just forgot to merge 😅 |
…uggingface#41170) * remove from modeling files * remaining changes * style / copies * revert deprecated models and fixup some models * oops
These embedding types are barely used and make the modeling files just more complex without justifying their existence. Position embedding types still exist in a few models; this PR just addresses the
relative_key(_query)ones.Some stats:
cc @hmellor this should remove any clashes with the kwargs you encountered in vLLM :D