NVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads. (ReadNVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads. (Read

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

2026/02/19 02:31
3 min di lettura
Per feedback o dubbi su questo contenuto, contattateci all'indirizzo crypto.news@mexc.com.

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

Darius Baruo Feb 18, 2026 18:31

NVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads.

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

NVIDIA's Run:ai platform can deliver 77% of full GPU throughput using just half the hardware allocation, according to joint benchmarking with cloud provider Nebius released February 18. The results demonstrate that enterprises running large language model inference can dramatically expand capacity without proportional GPU investment.

The tests, conducted on clusters with 64 NVIDIA H100 NVL GPUs and 32 NVIDIA HGX B200 GPUs, showed fractional GPU scheduling achieving near-linear performance scaling across 0.5, 0.25, and 0.125 allocations.

Hard Numbers from Production Testing

At 0.5 GPU allocation, the system supported 8,768 concurrent users while maintaining time-to-first-token under one second—86% of the 10,200 users supported at full allocation. Token generation hit 152,694 tokens per second, compared to 198,680 at full capacity.

Smaller models pushed these gains further. Phi-4-Mini running on 0.25 GPU fractions handled 72% more concurrent users than full-GPU deployment, achieving approximately 450,000 tokens per second with P95 latency under 300 milliseconds on 32 GPUs.

The mixed workload scenario proved most striking. Running Llama 3.1 8B, Phi-4 Mini, and Qwen-Embeddings simultaneously on fractional allocations tripled total concurrent system users compared to single-model deployment. Combined throughput exceeded 350,000 tokens per second at full scale with no cross-model interference.

Why This Matters for GPU Economics

Traditional Kubernetes schedulers allocate whole GPUs to individual models, leaving substantial capacity stranded. The benchmarks noted that even Qwen3-14B, the largest model tested at 14 billion parameters, occupies only 35% of an H100 NVL's 80GB capacity.

Run:ai's scheduler eliminates this waste through dynamic memory allocation. Users specify requirements directly; the system handles resource distribution without preconfiguration. Memory isolation happens at runtime while compute cycles distribute fairly among active processes.

This timing coincides with broader industry moves toward GPU partitioning. SoftBank and AMD announced validation testing on February 16 for similar fractioning capabilities on AMD Instinct GPUs, where single GPUs can split into up to eight logical devices.

Autoscaling Without Latency Spikes

Nebius tested automatic scaling with Llama 3.1 8B configured to add GPUs when concurrent users exceeded 50. Replicas scaled from 1 to 16 with clean ramp-up, stable utilization during pod warm-up, and negligible HTTP errors.

The practical implication: enterprises can run multiple inference models on existing GPU inventory, scale dynamically during peak demand, and reclaim idle capacity during off-hours for other workloads. For organizations facing fixed GPU budgets, fractioning transforms capacity planning from hardware procurement into software configuration.

Run:ai v2.24 is available now. NVIDIA plans to discuss the Nebius implementation at GTC 2026.

Image source: Shutterstock
  • nvidia
  • gpu
  • ai infrastructure
  • llm inference
  • run:ai
Opportunità di mercato
Logo NodeAI
Valore NodeAI (GPU)
$0.02308
$0.02308$0.02308
-2.94%
USD
Grafico dei prezzi in tempo reale di NodeAI (GPU)
Disclaimer: gli articoli ripubblicati su questo sito provengono da piattaforme pubbliche e sono forniti esclusivamente a scopo informativo. Non riflettono necessariamente le opinioni di MEXC. Tutti i diritti rimangono agli autori originali. Se ritieni che un contenuto violi i diritti di terze parti, contatta crypto.news@mexc.com per la rimozione. MEXC non fornisce alcuna garanzia in merito all'accuratezza, completezza o tempestività del contenuto e non è responsabile per eventuali azioni intraprese sulla base delle informazioni fornite. Il contenuto non costituisce consulenza finanziaria, legale o professionale di altro tipo, né deve essere considerato una raccomandazione o un'approvazione da parte di MEXC.

Potrebbe anche piacerti

Iran threatens retaliation as Trump vows to “hit hard,” crypto market under stress

Iran threatens retaliation as Trump vows to “hit hard,” crypto market under stress

United States President Donald Trump has vowed to continue military operations as the country’s Middle East war with Iran enters the third week of intensified hostilities
Condividi
Crypto.news2026/04/02 19:14
Edges higher ahead of BoC-Fed policy outcome

Edges higher ahead of BoC-Fed policy outcome

The post Edges higher ahead of BoC-Fed policy outcome appeared on BitcoinEthereumNews.com. USD/CAD gains marginally to near 1.3760 ahead of monetary policy announcements by the Fed and the BoC. Both the Fed and the BoC are expected to lower interest rates. USD/CAD forms a Head and Shoulder chart pattern. The USD/CAD pair ticks up to near 1.3760 during the late European session on Wednesday. The Loonie pair gains marginally ahead of monetary policy outcomes by the Bank of Canada (BoC) and the Federal Reserve (Fed) during New York trading hours. Both the BoC and the Fed are expected to cut interest rates amid mounting labor market conditions in their respective economies. Inflationary pressures in the Canadian economy have cooled down, emerging as another reason behind the BoC’s dovish expectations. However, the Fed is expected to start the monetary-easing campaign despite the United States (US) inflation remaining higher. Investors will closely monitor press conferences from both Fed Chair Jerome Powell and BoC Governor Tiff Macklem to get cues about whether there will be more interest rate cuts in the remainder of the year. According to analysts from Barclays, the Fed’s latest median projections for interest rates are likely to call for three interest rate cuts by 2025. Ahead of the Fed’s monetary policy, the US Dollar Index (DXY), which tracks the Greenback’s value against six major currencies, holds onto Tuesday’s losses near 96.60. USD/CAD forms a Head and Shoulder chart pattern, which indicates a bearish reversal. The neckline of the above-mentioned chart pattern is plotted near 1.3715. The near-term trend of the pair remains bearish as it stays below the 20-day Exponential Moving Average (EMA), which trades around 1.3800. The 14-day Relative Strength Index (RSI) slides to near 40.00. A fresh bearish momentum would emerge if the RSI falls below that level. Going forward, the asset could slide towards the round level of…
Condividi
BitcoinEthereumNews2025/09/18 01:23
What Does an XRP Address Look Like? And Why You Need a Destination Tag

What Does an XRP Address Look Like? And Why You Need a Destination Tag

Learn what an XRP address looks like, the difference between r- and X-addresses, and why a destination tag is essential to avoid losing your funds. The post What
Condividi
Stealthex2026/04/02 19:05

$30,000 in PRL + 15,000 USDT

$30,000 in PRL + 15,000 USDT$30,000 in PRL + 15,000 USDT

Deposit & trade PRL to boost your rewards!