#tokenizer #token #counting #fun #different #count #text

app token_trekker_rs

A fun and efficient Rust library to count tokens in text files using different tokenizers

4 releases

0.1.3 Mar 22, 2023
0.1.2 Mar 22, 2023
0.1.1 Mar 22, 2023
0.1.0 Mar 22, 2023

#28 in #counting

43 downloads per month

MIT/Apache

15KB
88 lines

token_trekker_rs

token_trekker_rs is a command-line tool for counting the total number of tokens in all files within a directory or matching a glob pattern, using various tokenizers.

Features

  • Supports multiple tokenizer options
  • Parallel processing for faster token counting
  • Outputs results in a colorized table

Installation

To install token_trekker_rs from crates.io, run:

cargo install token_trekker_rs

Building from Source

To build token_trekker_rs from the source code, first clone the repository:

git clone https://github.com/1rgs/token_trekker_rs.git
cd token_trekker_rs

Then build the project using cargo:

cargo build --release

The compiled binary will be available at ./target/release/token-trekker.

Usage

To count tokens in a directory or for files matching a glob pattern, run the following command:

token-trekker --path <path_or_glob_pattern> <tokenizer>

Replace <path_or_glob_pattern> with the path to the directory or the glob pattern of the files to process, and with one of the available tokenizer options:

  • p50k-base
  • p50k-edit
  • r50k-base
  • cl100k-base
  • gpt2

For example:

token_trekker_rs --path "path/to/files/*.txt" p50k-base

Dependencies

~26–44MB
~476K SLoC

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy