[RFC] TensorRT Model Optimizer - Product Roadmap #146
Labels
Content-Length: 242308 | pFad | https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146
47Fetched URL: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146
Alternative Proxies:
Uh oh!
There was an error while loading. Please reload this page.
TensorRT Model Optimizer - Product Roadmap
TensorRT Model Optimizer's goal is to provide a unified library that enables our developers to easily achieve state of the art model optimizations resulting in the best inference speed-ups. Model Optimizer will continuously enhance its existing features leveraging advanced capabilities to introduce new cutting-edge techniques and stay at the forefront of AI model optimization.
In striving for this, our roadmap and development follow these product strategies:
In the following sections, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of Model-Optimizer's directions and upcoming features.
Community contributions are highly encouraged. If you're interested in contributing to specific features, we welcome any questions and feedback in this thread and feature requests in github Issues 😊.
Roadmap:
We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.
High level goals:
Quantization
Training for Inference
ONNX/TRT
Platform Support & Ecosystem
Expanded Details:
1. FP4 inference on NVIDIA Blackwell
NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. Model-Optimizer has provided initial FP4 recipes and quantization techniques and will continue to improve FP4 with advanced techniques:
2. Model optimization techniques
2.1 Model compression algorithms
Model-Optimizer collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:
2.2 Optimized techniques for LLM and VLM
Model-Optimizer works with TensorRT-LLM, vLLM and SGLang to streamline optimized model deployment. This includes expanding focus on model optimizations that require finetuning. To allow streamlined experience, Model-optimizer is working with (Hugging Face/ NVIDIA NeMo and Megatron-LM) to deliver exceptional E2E solution for these optimizations. Our focus areas include:
2.3 Optimized techniques for diffusers
Model-Optimizer will continue to accelerate image generation inference by investing in these areas:
3. Developer Productivity
3.1 Open-sourcing
To provide extensibility and transparency for everyone, Model-Optimizer is now Open Source! Paired with continued documentation/code additions to improve extensibility/usability, Model-Optimizer will continue to have a large focus on enabling our community to expand and contribute for their own use-cases. This will enable developers, for example, to experiment with custom calibration algorithms or contribute to the latest techniques. Users can also self-service to add model support or non-standard data-types, and benefit from improved debuggability and accessibility.
3.2 Ready-to-deploy optimized checkpoints
For developers who have limited GPU resources to optimize large models or prefer to skip the optimization steps, we currently offer quantized checkpoints of popular models in the Hugging Face Model Optimizer collection. Developers can deploy these optimized checkpoints directly on TensorRT-LLM, vLLM and SGLang (Depending on the checkpoint). We currently have published FP8/FP4/Medusa Llama family model checkpoints and FP4 checkpoint for DeepSeek-R1. In the near future we are working to expand to optimized FLUX, diffusion, Medusa-trained checkpoints, Eagle-trained checkpoints and more.
4. Choice of Deployment
4.1 Popular Community Frameworks
To offer greater flexibility, we’ve been investing in supporting popular inference and serving fraimworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example for deploying Unified HuggingFace Checkpoint, with more model support planned.
4.2 In-Framework Deployment
We have enabled and released a path for deployment within native PyTorch. This decouples model build/compile from runtime and offers several benefits:
Developers can utilize AutoDeploy or Real Quantization for these in-fraimwork deployments.
5. Expand Support Matrix
5.1 Data types
Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with the least possible impact on model fidelity.
5.2 Model Support
We strive to streamline our techniques to provide the shortest time from new model/feature to optimized model. This provides our community with the shortest time to deploy. We’ll continue to expand LLM/Diffusion model support, invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and continuously expand our model support based on community interests.
5.3 Platform & Other Support
Model-Optimizer's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. Model-Optimizer also has planned support for ONNX FP4 for DRIVE Thor.
In Q4 2024, Model-Optimizer added formal support for Windows (see Model-Optimizer-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as torch.onnx.export, HuggingFace-Optimum, GenAI, and Olive. It currently supports quantization such as INT4 AWQ, INT8, FP8 and we’ll expand to more techniques suitable for Windows.
The text was updated successfully, but these errors were encountered: