Skip to content

Add support for BERT embedding models #5423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 11, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add in wordpiece tokenizer
  • Loading branch information
iamlemec committed Feb 8, 2024
commit 59c1829b0c39a3b74817c9a14b7deb0c74f0ad5e
31 changes: 21 additions & 10 deletions convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1594,26 +1594,37 @@ def set_vocab(self):
path = self.dir_model
added_tokens_path = self.dir_model if self.dir_model.exists() else None

# use huggingface vocab to get all tokens
vocab = HfVocab(path, added_tokens_path)
tokens, scores, toktypes = zip(*vocab.all_tokens())

assert len(tokens) == vocab.vocab_size

# for some reason set(toktypes) = {1, 3} so we need to compress it
all_types, toktypes1 = np.unique(toktypes, return_inverse=True)
n_token_types, toktypes1 = len(all_types), toktypes1.tolist()
# we need this to validate the size of the token_type embeddings
# though currently we are passing all zeros to the token_type embeddings
n_token_types = len(set(toktypes))
self.gguf_writer.add_uint32("tokenizer.ggml.token_type_count", n_token_types)

# convert tokens to SPM style
tokens = [
(t[2:] if t.startswith(b"##") else b"\xe2\x96\x81" + t) for t in tokens
]
# convert to phantom space vocab
def phantom(tok, typ):
if tok.startswith(b'[') and tok.endswith(b']'):
return tok
elif tok.startswith(b"##"):
return tok[2:]
else:
return b"\xe2\x96\x81" + tok
tokens = [phantom(t, y) for t, y in zip(tokens, toktypes)]

self.gguf_writer.add_tokenizer_model("llama")
# set up bos and eos tokens (cls and sep)
self.gguf_writer.add_bos_token_id(vocab.tokenizer.cls_token_id)
self.gguf_writer.add_eos_token_id(vocab.tokenizer.sep_token_id)

# add vocab to gguf
self.gguf_writer.add_tokenizer_model("bert")
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_scores(scores)
self.gguf_writer.add_token_types(toktypes) # ignore types for now (all zero)
self.gguf_writer.add_token_types(toktypes)

# handle special tokens
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
special_vocab.add_to_gguf(self.gguf_writer)

Expand Down
12 changes: 11 additions & 1 deletion examples/embedding/embedding.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,17 @@ int main(int argc, char ** argv) {
}

const int n_embd = llama_n_embd(model);
const auto * embeddings = llama_get_embeddings(ctx);
auto * embeddings = llama_get_embeddings(ctx);

// l2-normalize embeddings
float norm = 0;
for (int i = 0; i < n_embd; i++) {
norm += embeddings[i] * embeddings[i];
}
norm = sqrt(norm);
for (int i = 0; i < n_embd; i++) {
embeddings[i] /= norm;
}

for (int i = 0; i < n_embd; i++) {
printf("%f ", embeddings[i]);
Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy