NVIDIA / TensorRT-Edge-LLM

High-performance, light-weight C++ LLM and VLM Inference Software for Physical AI

302 42 Language: C++ License: Apache-2.0 Updated: 1d ago

README

TensorRT Edge-LLM

High-Performance Large Language Model Inference Framework for NVIDIA Edge Platforms

Overview   |   Examples   |   Documentation   |   Roadmap

Overview

TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. It enables efficient deployment of state-of-the-art language models on resource-constrained devices such as NVIDIA Jetson and NVIDIA DRIVE platforms. TensorRT Edge-LLM provides convenient Python scripts to convert HuggingFace checkpoints to ONNX. Engine build and end-to-end inference runs entirely on Edge platforms.

Getting Started

For the supported platforms, models and precisions, see the Overview. Get started with TensorRT Edge-LLM in <15 minutes. For complete installation and usage instructions, see the Quick Start Guide.

Documentation

Introduction

Overview - What is TensorRT Edge-LLM and key features
Supported Models - Complete model compatibility matrix

User Guide

Installation - Set up Python export pipeline and C++ runtime
Quick Start Guide - Run your first inference in ~15 minutes
Examples - End-to-end LLM, VLM, EAGLE, and LoRA workflows
Input Format Guide - Request format and specifications
Chat Template Format - Chat template configuration

Developer Guide

Software Design

Python Export Pipeline - Model export and quantization
Engine Builder - Building TensorRT engines
C++ Runtime Overview - Runtime system architecture
- LLM Inference Runtime
- LLM SpecDecode Runtime

Advanced Topics

Customization Guide - Customizing TensorRT Edge-LLM for your needs
TensorRT Plugins - Custom plugin development
Tests - Comprehensive test suite for contributors

Use Cases

🚗 Automotive

In-vehicle AI assistants
Voice-controlled interfaces
Scene understanding
Driver assistance systems

🤖 Robotics

Natural language interaction
Task planning and reasoning
Visual question answering
Human-robot collaboration

🏭 Industrial IoT

Equipment monitoring with NLP
Automated inspection
Predictive maintenance
Voice-controlled machinery

📱 Edge Devices

On-device chatbots
Offline language processing
Privacy-preserving AI
Low-latency inference

Tech Blogs

Coming soon

Stay tuned for technical deep-dives, optimization guides, and deployment best practices.

Latest News

[01/05] 🚀 Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1 ✨ ➡️ link
[01/05] 🚀 Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM ✨ ➡️ link

Follow our GitHub repository for the latest updates, releases, and announcements.

Support

Documentation: Full Documentation
Examples: Code Examples
Roadmap: Developer Roadmap
Issues: GitHub Issues
Discussions: GitHub Discussions
Forums: NVIDIA Developer Forums

License

Apache License 2.0

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Search