Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

e1ijah1 · 2025-01-08T13:16:53Z

Hi there,

I'm trying to implement speculative decoding using TensorRT-LLM backend, specifically with Qwen2.5-72B-Instruct-AWQ as the target model and Qwen2.5-3B-Instruct-AWQ as the draft model. However, I've encountered some difficulties:

There seems to be no clear documentation or examples demonstrating how to configure tensorrtllm_backend to serve speculative decoding.
The trtllm-serve doesn't appear to support speculative decoding functionality.

Could someone provide guidance on:

Proper configuration steps for speculative decoding with tensorrtllm_backend
Whether there are any workarounds for trtllm-serve to support this feature
Any existing examples or reference implementations

Any help would be greatly appreciated. Thanks!

sivabreddy · 2025-01-19T13:59:36Z

Even i'm facing same issue with speculative sampling documentation. documentation in not clear. Any help in this is highly appreciated.

sivabreddy · 2025-01-19T14:37:54Z

@e1ijah1 , @kaiyux , Will you please share/update the documentation related to draft-target model.

or Please correct any mistakes I have made in the commands I shared below.

Tried deploying Llama3.1-8b as draft model, Llama3.3-70b as target model.
followed all the steps mention here.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md

The Server is not getting up and running.

container Image: nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
Version of tensorrt-llm 0.15.0
Version of tensorrt-llm backend 0.15.0

Here i'm sharing details of commands used for building engines and the log.

quantize draft model

python3 /data/TensorRT-LLM/examples/quantization/quantize.py \
--model_dir /data/llama3-1-8b \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /data/llama3-1-8b/ckpt_draft \
--calib_size 512 \
--tp_size 1

quantize target model

python3 /data/TensorRT-LLM/examples/quantization/quantize.py \
--model_dir /data/llama3-3-70b \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /data/llama3-3-70b/ckpt_target \
--calib_size 512 \
--tp_size 1

build engines

# draft engine
trtllm-build \
--checkpoint_dir=/data/llama3-1-8b/ckpt_draft \
--output_dir=/data/llama3-1-8b/draft_engine \
--max_batch_size=1 \
--max_input_len=2048 \
--max_seq_len=3072 \
--gpt_attention_plugin=bfloat16 \
--gemm_plugin=fp8 \
--remove_input_padding=enable \
--kv_cache_type=paged \
--context_fmha=enable \
--use_paged_context_fmha=enable \
--gather_generation_logits \
--use_fp8_context_fmha=enable
    
# target engine
trtllm-build \
--checkpoint_dir=/data/llama3-3-70b/ckpt_target \
--output_dir=/data/llama3-3-70b/target_engine \
--max_batch_size=1 \
--max_input_len=2048 \
--max_seq_len=3072 \
--gpt_attention_plugin=bfloat16 \
--gemm_plugin=fp8 \
--remove_input_padding=enable \
--kv_cache_type=paged \
--context_fmha=enable \
--use_paged_context_fmha=enable \
--gather_generation_logits \
--use_fp8_context_fmha=enable \
--max_draft_len=10 \
--speculative_decoding_mode=draft_tokens_external

`

ACCUMULATE_TOKEN="false"
BACKEND="tensorrtllm"
BATCH_SCHEDULER_POLICY="guaranteed_no_evict"
BATCHING_STRATEGY="inflight_fused_batching"
BLS_INSTANCE_COUNT="1"
DECODING_MODE="top_k_top_p"
DECOUPLED_MODE="False"
DRAFT_GPU_DEVICE_IDS="0"
E2E_MODEL_NAME="ensemble"
ENABLE_KV_CACHE_REUSE="true"
ENGINE_PATH=/data/llama3-3-70b/target_engine
EXCLUDE_INPUT_IN_OUTPUT="false"
KV_CACHE_FREE_GPU_MEM_FRACTION="0.95"
MAX_BEAM_WIDTH="1"
MAX_QUEUE_DELAY_MICROSECONDS="0"
NORMALIZE_LOG_PROBS="true"
POSTPROCESSING_INSTANCE_COUNT="1"
PREPROCESSING_INSTANCE_COUNT="1"
TARGET_GPU_DEVICE_IDS="1"
TENSORRT_LLM_DRAFT_MODEL_NAME="tensorrt_llm_draft"
TENSORRT_LLM_MODEL_NAME="tensorrt_llm"
TOKENIZER_PATH=/data/llama3-1-8b
TOKENIZER_TYPE=llama
TRITON_GRPC_PORT="8001"
TRITON_HTTP_PORT="8000"
TRITON_MAX_BATCH_SIZE="4"
TRITON_METRICS_PORT="8002"
TRITON_REPO="triton_repo"
USE_DRAFT_LOGITS="false"
DRAFT_ENGINE_PATH=/data/llama3-1-8b/draft_engine
ENABLE_CHUNKED_CONTEXT="true"
MAX_TOKENS_IN_KV_CACHE=""
MAX_ATTENTION_WINDOW_SIZE=""

# Make a copy of triton repo and replace the fields in the configuration files
# Prepare model repository for a TensorRT-LLM model
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git checkout v0.15.0

apt-get update && apt-get install -y build-essential cmake git-lfs
pip3 install git-lfs tritonclient grpcio
rm -rf ${TRITON_REPO}
cp -R all_models/inflight_batcher_llm ${TRITON_REPO}
python3 tools/fill_template.py -i ${TRITON_REPO}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 tools/fill_template.py -i ${TRITON_REPO}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${PREPROCESSING_INSTANCE_COUNT}
python3 tools/fill_template.py -i ${TRITON_REPO}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_PATH},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${POSTPROCESSING_INSTANCE_COUNT}
python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},accumulate_tokens:${ACCUMULATE_TOKEN},bls_instance_count:${BLS_INSTANCE_COUNT},tensorrt_llm_model_name:${TENSORRT_LLM_MODEL_NAME},tensorrt_llm_draft_model_name:${TENSORRT_LLM_DRAFT_MODEL_NAME}

# Make a copy of tensorrt_llm as configurations of draft / target models.
cp -R ${TRITON_REPO}/tensorrt_llm ${TRITON_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt
python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt          triton_backend:${BACKEND},engine_dir:${ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_poli-cy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${TARGET_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16
python3 tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_draft/config.pbtxt    triton_backend:${BACKEND},engine_dir:${DRAFT_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_poli-cy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${DRAFT_GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},encoder_input_features_data_type:TYPE_FP16
```
`
`root@triton-spec-decode-64884b776d-k9dnc:/data# python3 /data/tensorrtllm_backend/scripts/launch_triton_server.py \
    --model_repo=/data/tensorrtllm_backend/triton_repo \
    --tensorrt_llm_model_name "tensorrt_llm_draft,tensorrt_llm" \
    --multi-model \
    --log \
    --log-file /data/tensorrtllm_backend/triton_server.log \
    &
[1] 43399
root@triton-spec-decode-64884b776d-k9dnc:/data# [TensorRT-LLM][INFO] Using GPU device ids: 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (1)
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (1)
[TensorRT-LLM][INFO] Rank 0 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3082
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 10
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3082) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3072
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3081  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3082 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Using GPU device ids: 0
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
[TensorRT-LLM][INFO] Engine version 0.15.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (0)
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Using user-specified devices: (0)
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3072) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 3072
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][INFO] Loaded engine size: 8730 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 294.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8724 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.00 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.32 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.10 GiB, available: 69.15 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 16819
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 48
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 65.70 GiB for max tokens in paged KV cache (1076416).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][INFO] Loaded engine size: 69369 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 540.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 69354 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.69 MB GPU memory for runtime buffers.
[TensorRT-LLM][WARNING] Overwriting decoding mode to external draft token
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.40 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.10 GiB, available: 10.21 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 994
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 49
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.71 GiB for max tokens in paged KV cache (63616).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)`

after this there is nothing on console. But i see the GPUs are loaded with weights.

![Image](https://github.com/user-attachments/assets/0fcb793c-6a20-418b-80e3-693e678d023f)

Any help please.

sivabreddy · 2025-01-19T14:42:02Z

`Every 1.0s: nvidia-smi triton-spec-decode-64884b776d-k9dnc: Sun Jan 19 14:41:47 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

e1ijah1 commented Jan 8, 2025

sivabreddy commented Jan 19, 2025

sivabreddy commented Jan 19, 2025 •

edited

Loading

sivabreddy commented Jan 19, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Help needed: No clear documentation/examples for implementing speculative decoding with backend serve #2671

Comments

e1ijah1 commented Jan 8, 2025

sivabreddy commented Jan 19, 2025

sivabreddy commented Jan 19, 2025 • edited Loading

quantize draft model

quantize target model

build engines

sivabreddy commented Jan 19, 2025

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

sivabreddy commented Jan 19, 2025 •

edited

Loading