Skip to content

NLPOptimize/awesome-tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation


awesome_tokenizers

Awesome-tokenizer Awesome

A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.

🔹 WordPiece Tokenizer Implementations

  • 🔥 FlashTokenizer (C++/Python)
    • The world's fastest CPU tokenizer library!

🔹 BPE (Byte Pair Encoding) Implementations

  • OpenAI TikToken (Rust/Python)
    • Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.
  • huggingface/tokenizers (Rust/Python)
    • General-purpose tokenizer supporting BPE, from Hugging Face.
  • bpe-tokenizer (Rust) (Rust)
    • Rust BPE tokenizer library, identifying frequent pairs effectively.
  • YouTokenToMe (C++/Python)
    • Efficient BPE tokenizer with fast training and inference, developed by VK.com.
  • 🔥 /fastBPE (C++/Python)
    • Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.
  • 🔥 sentencepiece (C++/Python)
    • Google's SentencePiece implementation also provides BPE as one of the algorithms.
  • Subword-nmt (Python)
    • Python implementation commonly used in MT research, simple but slower.
  • 🔥 rs-bpe (Rust)
    • A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

🔹 SentencePiece Implementations


Contributing

Your contributions are always welcome! Please take a look at the contribution guidelines first.

Question

Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.

💬 LINE

💬 Telegram

💬 WeChat

About

A curated list of tokenizer libraries for blazing-fast NLP processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy