-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Add JinaBERT model #35320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add JinaBERT model #35320
Conversation
@@ -147,6 +147,7 @@ | |||
("instructblipvideo", "InstructBlipVideoConfig"), | |||
("jamba", "JambaConfig"), | |||
("jetmoe", "JetMoeConfig"), | |||
("jina_bert", "JinaBertConfig"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the naming of other models, we could rename this to jinabert
?
I followed this guide and first ran I didn't delete any of the files generated in the first step, so I think there is a lot to clean up. Also, Moreover, I left some TODOs in |
I've updated the PR where I was somewhat certain. I would still need help, especially regarding the tests and other checks. One thing I've noticed is that quite some tests fail with class JinaBertLMPredictionHead(BertLMPredictionHead):
def _tie_weights(self):
raise AttributeError("Not needed for JinaBert") |
Prevents this error: TypeError: unsupported operand type(s) for +` -> 'Tensor' and 'NoneType' This occurred on line 251 in modeling_jina_bert.py `attention_probs = nn.functional.softmax(attention_scores + bias, dim=-1)`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It would also be helpful to add a code snippet in the docstrings showing how to generate the embeddings
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Thanks for your review! I took the docs from the original repository but still applied your suggestions. |
hey sorry for the delay! Having a look in a bit! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow the in a bit became 3 month....
My main comment is that the tokenizer should not be in the modeling, this never happens in transformers!
Appart from that, for the modular a lot of blocks seems to be very similar so inheritance should help you !
What does this PR do?
This PR adds
JinaBERT
to transformers.This enables running
jinaai/jina-embeddings-v2-base-code
withouttrust_remote_code
.Relevant issue and discussion is here: #27035.
Note that there are two implementations in use for Jina Embeddings v2:
This PR covers jinaai/jina-embeddings-v2-base-code which uses this implementation.
Additionally, there is jinaai/jina-embeddings-v2-base-en (and variants
small-en
,base-zh
,base-de
,base-es
) which use a different implementation.I'm not sure if we can make a single
JinaBERT
implementation that works for bothbase-code
andbase-en
.Otherwise, we probably would want to have two
JinaBERT
implementations, for instanceJinaBERT
forbase-en
and variants andJinaBERTv2
forbase-code
.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Tests
Quite some of the generated tests fail and I'd need help there to judge what we need or what must be updated.
I've updated the
test_inference_no_head_absolute_embedding
integration test to assert on the output that I get from the original implementation.Moreover, I added an
test_encode
integration test to assert that we get the same results as in the example provided by Jina AI.Who can review?
@ArthurZucker
@bwanglzu might be interested too