
A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.
- 🔥 FlashTokenizer (C++/Python)
- The world's fastest CPU tokenizer library!
- huggingface/tokenizers (Rust/Python)
- Official Hugging Face tokenizer, fast Rust implementation with Python bindings.
- 🔥 FastBertTokenizer (C#)
- Highly optimized tokenizer for speed, reduced accuracy on non-English inputs.
- BertTokenizers (C#)
- Microsoft's original C# tokenizer implementation (slower than FastBertTokenizer).
- 🔥 rust-tokenizers (Rust/Python)
- Rust tokenizer library; faster than pure Python but slower than BlingFire or Flash.
- tokenizers-cpp (C++)
- Wrapper around SentencePiece and Hugging Face’s tokenizers; not a standalone implementation.
- bertTokenizer (Java) (Java)
- Java-based Bert tokenizer implementation.
- ZhuoruLin/fast-wordpiece (Rust)
- Rust implementation using LinMaxMatching; likely comparable or slower than optimized C++ versions.
- huggingface_tokenizer_cpp (C++)
- Naive pure C++ implementation; slow performance.
- SeanLee97/BertWordPieceTokenizer.jl (Julia)
- Julia implementation, not widely benchmarked.
- 🔥 BlingFire (C++/Python)
- Microsoft's high-speed tokenizer optimized for batch processing, available as Python bindings.
- tensorflow-text WordpieceTokenizer (C++/Python)
- TensorFlow-integrated Google's tokenizer optimized for use in TensorFlow pipelines.
- transformers BertTokenizer (Python)
- Hugging Face's Python implementation; easy to use but slower due to pure Python nature.
- Deep Java Library (DJL) BertTokenizer (Java)
- Amazon’s Java implementation, integrated within DJL framework.
- tokenizers.net (C#)
- .NET/C# binding of Hugging Face tokenizers optimized for .NET runtimes.
- Tokenizers.jl (Julia)
- Julia tokenizer library inspired by Hugging Face implementations.
- fast-bert-tokenizer-py (Python/Cython)
- Python tokenizer accelerated with Cython.
- ml-commons/tokenizer (C++)
- High-performance C++ tokenizer supporting WordPiece and other algorithms.
- OpenAI TikToken (Rust/Python)
- Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.
- huggingface/tokenizers (Rust/Python)
- General-purpose tokenizer supporting BPE, from Hugging Face.
- bpe-tokenizer (Rust) (Rust)
- Rust BPE tokenizer library, identifying frequent pairs effectively.
- YouTokenToMe (C++/Python)
- Efficient BPE tokenizer with fast training and inference, developed by VK.com.
- 🔥 /fastBPE (C++/Python)
- Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.
- 🔥 sentencepiece (C++/Python)
- Google's SentencePiece implementation also provides BPE as one of the algorithms.
- Subword-nmt (Python)
- Python implementation commonly used in MT research, simple but slower.
- 🔥 rs-bpe (Rust)
- A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
- google/sentencepiece (C++/Python)
- Google's official, language-independent, neural-based subword tokenizer.
- sentencepiece-rs (Rust)
- Rust binding for Google's SentencePiece.
- huggingface/tokenizers (Rust/Python)
- Hugging Face tokenizer library supporting SentencePiece.
- TensorFlow Text SentencepieceTokenizer (C++/Python)
- Google's TensorFlow Text includes SentencePiece tokenizer optimized for TF environments.
- sentencepiece.NET (C#)
- .NET binding for SentencePiece tokenizer.
- sentencepiece-jni (Java)
- JNI bindings for Google's SentencePiece tokenizer for Java applications.
- sentencepiece-swift (Swift)
- Swift bindings for Google's SentencePiece tokenizer.
Your contributions are always welcome! Please take a look at the contribution guidelines first.
Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.