NVIDIA Arena
  • News
  • Tech
  • Generative AI
  • Computers
  • Graphics Card
  • Robotics
  • Cybersecurity
No Result
View All Result
  • News
  • Tech
  • Generative AI
  • Computers
  • Graphics Card
  • Robotics
  • Cybersecurity
No Result
View All Result
NVIDIA Arena
No Result
View All Result

Home » NVIDIA Boosts OpenAI gpt-oss Speed to 1.5M TPS

NVIDIA Boosts OpenAI gpt-oss Speed to 1.5M TPS

Joel Wamono by Joel Wamono
August 7, 2025
in Generative AI
Reading Time: 3 mins read
A A
NVIDIA Boosts OpenAI gpt-oss Speed to 1.5M TPS
Share on FacebookShare on Twitter

The new OpenAI gpt-oss NVIDIA collaboration has set a performance milestone in the AI world. NVIDIA’s GB200 NVL72 system now delivers 1.5 million tokens per second (TPS) on gpt-oss-120b, setting the standard for inference throughput.

Ultra-Fast Inference on Blackwell GPUs

Using its latest Blackwell architecture, NVIDIA tuned the gpt-oss-20b and gpt-oss-120b models to achieve unmatched inference speeds. These open-weight LLMs use a mixture-of-experts (MoE) design with SwigGLU activations and 128k token context windows.

Running at FP4 precision, the models fit efficiently on modern data center GPUs. The smaller 20B model activates 3.6B parameters per token, while the larger 120B model activates 5.1B, spread across 32 to 128 experts.

Optimized Software Stack for Developers

NVIDIA didn’t stop at hardware acceleration. It also collaborated with Hugging Face, vLLM, TensorRT-LLM, and Ollama to deliver low-latency inference with enhanced kernel libraries.

FlashInfer provides optimized attention and routing for large language models. The integration of Triton and CUTLASS kernels ensures smooth operation across Hopper and Blackwell GPUs.

Multi-Platform Deployment Options

Developers have multiple ways to deploy these optimized models:

  • vLLM offers an OpenAI-compatible web server for rapid prototyping.
  • TensorRT-LLM provides a Docker-ready deployment toolkit with Hugging Face integration.
  • NVIDIA Launchable supports one-click deployment in JupyterLab for testing in the cloud.
  • Dynamo, an open-source platform, enables large-scale disaggregated inference with autoscaling.

This flexibility helps developers find the right balance between cost, latency, and performance.

Run Locally on RTX AI PCs

For developers working locally, the gpt-oss-20b model runs smoothly on any RTX AI PC with 16GB VRAM. Support is available through tools like Ollama, Llama.cpp, and Microsoft Foundry Local, allowing rapid experimentation with reduced latency and greater data privacy.

The 120B model requires more power, but it runs efficiently on RTX PRO workstations with the right setup.

Enterprise Access Through NVIDIA NIM

For enterprise needs, NVIDIA delivers the models through its NIM microservices. These packaged APIs simplify deployment on any GPU infrastructure with secure, flexible, and privacy-focused features.

NIM supports both gpt-oss-20b and gpt-oss-120b, letting businesses scale AI solutions with minimal setup.

Final Thoughts

The OpenAI gpt-oss NVIDIA partnership has reshaped what’s possible for open-weight LLMs. With 1.5 million TPS, flexible deployment tools, and broad compatibility, NVIDIA’s platform now supports massive-scale applications across the AI ecosystem.

Whether you’re a startup, enterprise, or indie developer, this release opens the door to faster and smarter AI deployments.

Tags: GB200 NVL72gpt-oss-120bgpt-oss-20bNVIDIA BlackwellOpenAI gpt-oss NVIDIA
Previous Post

Mafia The Old Country Launches on GeForce NOW

Next Post

OpenAI gpt-oss Models Now Run Fast on NVIDIA RTX

Related Posts

Nvidia physical AI
Generative AI

Nvidia Physical AI Push Expands Into South Korea

by Nakayenga Patricia Renee
1 month ago
0

Nvidia physical AI ambitions are gaining momentum as the company explores new partnerships in...

Read moreDetails
Meta $3 Trillion
Tech

Meta $3 Trillion Prediction: Can AI Push META Into the Elite Club?

by Nakayenga Patricia Renee
3 months ago
0

Meta $3 Trillion is quickly becoming a serious talking point among investors watching the...

Read moreDetails
The Rise of AI Inference: Nvidia’s Pivot to ‘AI Factories’
Generative AI

The Rise of AI Inference: Nvidia’s Pivot to ‘AI Factories’

by Dancan Odhiambo
4 months ago
0

The landscape of artificial intelligence (AI) is evolving, and Nvidia, a company long known...

Read moreDetails
Nvidia’s Stock Price Prediction for 2026: Will It Double?
Generative AI

Nvidia’s Stock Price Prediction for 2026: Will It Double?

by Dancan Odhiambo
4 months ago
0

Nvidia has been one of the standout performers in global markets over the past...

Read moreDetails
The Impact of Geopolitical Risks on Nvidia’s Business and Stock Price
Generative AI

The Impact of Geopolitical Risks on Nvidia’s Business and Stock Price

by Dancan Odhiambo
4 months ago
0

As one of the world’s leading technology companies, Nvidia has become synonymous with cutting-edge...

Read moreDetails
Nvidia’s Expansion into Robotics and Quantum Computing: A New Growth Opportunity?
Generative AI

Nvidia’s Expansion into Robotics and Quantum Computing: A New Growth Opportunity?

by Dancan Odhiambo
4 months ago
0

Nvidia, a company best known for its graphics processing units (GPUs), has rapidly evolved...

Read moreDetails
Next Post
OpenAI gpt-oss Models Now Run Fast on NVIDIA RTX

OpenAI gpt-oss Models Now Run Fast on NVIDIA RTX

WUCHANG: Fallen Feathers joins GeForce NOW

WUCHANG: Fallen Feathers joins GeForce NOW

  • About NVIDIArena
  • Advertise With NVIDIArena
  • Contact Us
  • Privacy Policy
  • Terms and Conditions

© 2026 Nvidia Arena

No Result
View All Result
  • News
  • Tech
  • Generative AI
  • Computers
  • Graphics Card
  • Robotics
  • Cybersecurity

© 2026 Nvidia Arena