Awesome-tokenizer

A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.

🔹 WordPiece Tokenizer Implementations

🔥 FlashTokenizer (C++/Python)
- The world's fastest CPU tokenizer library!

huggingface/tokenizers (Rust/Python)
- Official Hugging Face tokenizer, fast Rust implementation with Python bindings.
🔥 FastBertTokenizer (C#)
- Highly optimized tokenizer for speed, reduced accuracy on non-English inputs.
BertTokenizers (C#)
- Microsoft's original C# tokenizer implementation (slower than FastBertTokenizer).
🔥 rust-tokenizers (Rust/Python)
- Rust tokenizer library; faster than pure Python but slower than BlingFire or Flash.
tokenizers-cpp (C++)
- Wrapper around SentencePiece and Hugging Face’s tokenizers; not a standalone implementation.
bertTokenizer (Java) (Java)
- Java-based Bert tokenizer implementation.
ZhuoruLin/fast-wordpiece (Rust)
- Rust implementation using LinMaxMatching; likely comparable or slower than optimized C++ versions.
huggingface_tokenizer_cpp (C++)
- Naive pure C++ implementation; slow performance.
SeanLee97/BertWordPieceTokenizer.jl (Julia)
- Julia implementation, not widely benchmarked.
🔥 BlingFire (C++/Python)
- Microsoft's high-speed tokenizer optimized for batch processing, available as Python bindings.
tensorflow-text WordpieceTokenizer (C++/Python)
- TensorFlow-integrated Google's tokenizer optimized for use in TensorFlow pipelines.
transformers BertTokenizer (Python)
- Hugging Face's Python implementation; easy to use but slower due to pure Python nature.
Deep Java Library (DJL) BertTokenizer (Java)
- Amazon’s Java implementation, integrated within DJL framework.
tokenizers.net (C#)
- .NET/C# binding of Hugging Face tokenizers optimized for .NET runtimes.
Tokenizers.jl (Julia)
- Julia tokenizer library inspired by Hugging Face implementations.
fast-bert-tokenizer-py (Python/Cython)
- Python tokenizer accelerated with Cython.
ml-commons/tokenizer (C++)
- High-performance C++ tokenizer supporting WordPiece and other algorithms.

🔹 BPE (Byte Pair Encoding) Implementations

OpenAI TikToken (Rust/Python)
- Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.
huggingface/tokenizers (Rust/Python)
- General-purpose tokenizer supporting BPE, from Hugging Face.
bpe-tokenizer (Rust) (Rust)
- Rust BPE tokenizer library, identifying frequent pairs effectively.
YouTokenToMe (C++/Python)
- Efficient BPE tokenizer with fast training and inference, developed by VK.com.
🔥 /fastBPE (C++/Python)
- Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.
🔥 sentencepiece (C++/Python)
- Google's SentencePiece implementation also provides BPE as one of the algorithms.
Subword-nmt (Python)
- Python implementation commonly used in MT research, simple but slower.
🔥 rs-bpe (Rust)
- A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

🔹 SentencePiece Implementations

google/sentencepiece (C++/Python)
- Google's official, language-independent, neural-based subword tokenizer.
sentencepiece-rs (Rust)
- Rust binding for Google's SentencePiece.
huggingface/tokenizers (Rust/Python)
- Hugging Face tokenizer library supporting SentencePiece.
TensorFlow Text SentencepieceTokenizer (C++/Python)
- Google's TensorFlow Text includes SentencePiece tokenizer optimized for TF environments.
sentencepiece.NET (C#)
- .NET binding for SentencePiece tokenizer.
sentencepiece-jni (Java)
- JNI bindings for Google's SentencePiece tokenizer for Java applications.
sentencepiece-swift (Swift)
- Swift bindings for Google's SentencePiece tokenizer.

Contributing

Your contributions are always welcome! Please take a look at the contribution guidelines first.

Question

Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.

💬 LINE

💬 Telegram

💬 WeChat

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-tokenizer

🔹 WordPiece Tokenizer Implementations

🔹 BPE (Byte Pair Encoding) Implementations

🔹 SentencePiece Implementations

Contributing

Question

About

Uh oh!

Releases

Packages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

NLPOptimize/awesome-tokenizers

Folders and files

Latest commit

History

Repository files navigation

Awesome-tokenizer

🔹 WordPiece Tokenizer Implementations

🔹 BPE (Byte Pair Encoding) Implementations

🔹 SentencePiece Implementations

Contributing

Question

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Packages