Skip to content

Add JinaBERT model #35320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Conversation

joelpaulkoch
Copy link
Contributor

@joelpaulkoch joelpaulkoch commented Dec 18, 2024

What does this PR do?

This PR adds JinaBERT to transformers.
This enables running jinaai/jina-embeddings-v2-base-code without trust_remote_code.

Relevant issue and discussion is here: #27035.

Note that there are two implementations in use for Jina Embeddings v2:
This PR covers jinaai/jina-embeddings-v2-base-code which uses this implementation.
Additionally, there is jinaai/jina-embeddings-v2-base-en (and variants small-en, base-zh, base-de, base-es) which use a different implementation.

I'm not sure if we can make a single JinaBERT implementation that works for both base-code and base-en.
Otherwise, we probably would want to have two JinaBERT implementations, for instance JinaBERT for base-en and variants and JinaBERTv2 for base-code.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Tests

Quite some of the generated tests fail and I'd need help there to judge what we need or what must be updated.

I've updated the test_inference_no_head_absolute_embedding integration test to assert on the output that I get from the original implementation.
Moreover, I added an test_encode integration test to assert that we get the same results as in the example provided by Jina AI.

Who can review?

@ArthurZucker

@bwanglzu might be interested too

@@ -147,6 +147,7 @@
("instructblipvideo", "InstructBlipVideoConfig"),
("jamba", "JambaConfig"),
("jetmoe", "JetMoeConfig"),
("jina_bert", "JinaBertConfig"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the naming of other models, we could rename this to jinabert?

@joelpaulkoch
Copy link
Contributor Author

I followed this guide and first ran transformers-cli add-new-model-like, then I've created modular_jina_bert.py to structure the code according to the modular transformers concept.

I didn't delete any of the files generated in the first step, so I think there is a lot to clean up.

Also, modular_jina_bert.py is mostly copy-paste from the original implemenation.
I only included functions that changed in comparison to BERT.
There are probably things that could be improved or removed.

Moreover, I left some TODOs in modular_jina_bert.py for open points.

@joelpaulkoch
Copy link
Contributor Author

joelpaulkoch commented Jan 3, 2025

I've updated the PR where I was somewhat certain. I would still need help, especially regarding the tests and other checks.

One thing I've noticed is that quite some tests fail with AttributeError: Not needed for JinaBert which is a result of what I did here (following the modular transformers guide as JinaBertLMPredictionHead does not define _tie_weights but BertLMPredictionHead does):

class JinaBertLMPredictionHead(BertLMPredictionHead):
    def _tie_weights(self):
        raise AttributeError("Not needed for JinaBert")

@joelpaulkoch joelpaulkoch marked this pull request as ready for review January 10, 2025 10:03
@joelpaulkoch joelpaulkoch marked this pull request as draft January 10, 2025 10:05
Prevents this error: TypeError: unsupported operand type(s) for +` ->
'Tensor' and 'NoneType'

This occurred on line 251 in modeling_jina_bert.py
`attention_probs = nn.functional.softmax(attention_scores + bias,
dim=-1)`
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It would also be helpful to add a code snippet in the docstrings showing how to generate the embeddings

joelpaulkoch and others added 4 commits January 28, 2025 20:35
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@joelpaulkoch
Copy link
Contributor Author

Thanks for your review! I took the docs from the original repository but still applied your suggestions.
I will add a snippet in the docstrings

@ArthurZucker
Copy link
Collaborator

hey sorry for the delay! Having a look in a bit!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow the in a bit became 3 month....
My main comment is that the tokenizer should not be in the modeling, this never happens in transformers!

Appart from that, for the modular a lot of blocks seems to be very similar so inheritance should help you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy