tensorrt-llm

Here are 28 public repositories matching this topic...

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla qwen3

Updated Jul 14, 2025
Python

collabora / WhisperLive

Star

A nearly-live implementation of OpenAI's Whisper.

text-to-speech translation voice-recognition openai obs dictation whisper tensorrt openvino openvino-intel tensorrt-llm whisper-tensorrt

Updated Jul 21, 2025
Python

shashikg / WhisperS2T

Star

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

deep-learning speech-recognition vad speech-to-text whisper asr tensorrt voice-activity-detection tensorrt-llm

Updated Aug 27, 2024
Jupyter Notebook

huggingface / optimum-benchmark

Star

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

benchmark pytorch openvino onnxruntime text-generation-inference neural-compressor tensorrt-llm

Updated May 28, 2025
Python

coderonion / awesome-cuda-and-hpc

Star

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

Updated Jul 11, 2025

npuichigo / openai_trtllm

Star

OpenAI compatible API for TensorRT LLM triton backend

triton-inference-server openai-api llm langchain tensorrt-llm

Updated Aug 1, 2024
Rust

Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.

tensorflow torch tensorrt serving triton-inference-server dynamic-batching vllm tensorrt-llm

Updated May 8, 2025
C++

NetEase-Media / grps_trtllm

Star

Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

Updated May 14, 2025
Python

openhackathons-org / End-to-End-LLM

Star

This repository is an AI Bootcamp material that consist of a workflow for LLM

natural-language-processing deep-learning question-answering prompt-tuning p-tuning llm genai nemo-guardrails tensorrt-llm nemo-megatron

Updated Jul 21, 2025
Jupyter Notebook

vossr / Chat-With-RTX-python-api

Star

Chat With RTX Python API

tensorrt llm llm-inference tensorrt-llm mistral-7b llama2-13b chat-with-rtx nvidia-chat-with-rtx

Updated May 11, 2025
Python

guidance-ai / llgtrt

Star

TensorRT-LLM server with Structured Outputs (JSON) built with Rust

json regex guidance cfg openai-api tensorrt-llm structured-generation

Updated Apr 25, 2025
Rust

argonne-lcf / LLM-Inference-Bench

Star

LLM-Inference-Bench

benchmark inference deepspeed llm llamacpp vllm tensorrt-llm

Updated Jul 18, 2025
Jupyter Notebook

menloresearch / cortex.tensorrt-llm

Star

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.

nvidia jan tensorrt llm tensorrt-llm