The new OpenAI gpt-oss NVIDIA collaboration has set a performance milestone in the AI world. NVIDIA’s GB200 NVL72 system now delivers 1.5 million tokens per second (TPS) on gpt-oss-120b, setting the standard for inference throughput.
Ultra-Fast Inference on Blackwell GPUs
Using its latest Blackwell architecture, NVIDIA tuned the gpt-oss-20b and gpt-oss-120b models to achieve unmatched inference speeds. These open-weight LLMs use a mixture-of-experts (MoE) design with SwigGLU activations and 128k token context windows.
Running at FP4 precision, the models fit efficiently on modern data center GPUs. The smaller 20B model activates 3.6B parameters per token, while the larger 120B model activates 5.1B, spread across 32 to 128 experts.
Optimized Software Stack for Developers
NVIDIA didn’t stop at hardware acceleration. It also collaborated with Hugging Face, vLLM, TensorRT-LLM, and Ollama to deliver low-latency inference with enhanced kernel libraries.
FlashInfer provides optimized attention and routing for large language models. The integration of Triton and CUTLASS kernels ensures smooth operation across Hopper and Blackwell GPUs.
Multi-Platform Deployment Options
Developers have multiple ways to deploy these optimized models:
- vLLM offers an OpenAI-compatible web server for rapid prototyping.
- TensorRT-LLM provides a Docker-ready deployment toolkit with Hugging Face integration.
- NVIDIA Launchable supports one-click deployment in JupyterLab for testing in the cloud.
- Dynamo, an open-source platform, enables large-scale disaggregated inference with autoscaling.
This flexibility helps developers find the right balance between cost, latency, and performance.
Run Locally on RTX AI PCs
For developers working locally, the gpt-oss-20b model runs smoothly on any RTX AI PC with 16GB VRAM. Support is available through tools like Ollama, Llama.cpp, and Microsoft Foundry Local, allowing rapid experimentation with reduced latency and greater data privacy.
The 120B model requires more power, but it runs efficiently on RTX PRO workstations with the right setup.
Enterprise Access Through NVIDIA NIM
For enterprise needs, NVIDIA delivers the models through its NIM microservices. These packaged APIs simplify deployment on any GPU infrastructure with secure, flexible, and privacy-focused features.
NIM supports both gpt-oss-20b and gpt-oss-120b, letting businesses scale AI solutions with minimal setup.
Final Thoughts
The OpenAI gpt-oss NVIDIA partnership has reshaped what’s possible for open-weight LLMs. With 1.5 million TPS, flexible deployment tools, and broad compatibility, NVIDIA’s platform now supports massive-scale applications across the AI ecosystem.
Whether you’re a startup, enterprise, or indie developer, this release opens the door to faster and smarter AI deployments.








