the difference of quantization implementation between quantize.py and convert_checkpoint.py #2681
Labels
triaged
Issue has been triaged by maintainers
Content-Length: 218045 | pFad | http://github.com/NVIDIA/TensorRT-LLM/issues/2681
E5Fetched URL: http://github.com/NVIDIA/TensorRT-LLM/issues/2681
Alternative Proxies:
I've noticed that I can apply SmoothQuant to models using the command:
python quantize.py --model_dir $MODEL_PATH --qformat int8_sq --kv_cache_dtype int8 --output_dir $OUTPUT_PATH
in quantize.py. Additionally, I can also achieve this by running:
python3 convert_checkpoint.py --model_dir ./tmp/Qwen/7B/
--output_dir ./tllm_checkpoint_1gpu_sq
--dtype float16
--smoothquant 0.5
--per_token
--per_channel
It seems that the latter approach is more flexible since I can adjust parameters like the SmoothQuant ratio, per_token, and other options.
Does the first command offer broader compatibility, while the latter is restricted to models that specifically use convert_checkpoint.py? So, when a model has a corresponding convert_checkpoint.py file, I should prioritize using it first.
Furthermore, I noticed that both commands generate safetensors and a config.json. Is it possible to use quantize.py to generate config.json and manually modify the quantization-related fields afterward?
The text was updated successfully, but these errors were encountered: