Case study · 8×A100 inference cluster

An anonymized example from a production inference environment that relied on 8× A100 GPUs across two nodes. The service was revenue-critical and needed stable performance during a launch period.

Background

The team ran a real-time video and image inference API. Traffic came in bursts, peaking between 1.2k and 3.4k requests per second. The system used a familiar stack: PyTorch for training, exported to ONNX, then to TensorRT for deployment. Two nodes each hosted 4× A100 80GB GPUs behind a Kubernetes-based orchestration layer.

On paper, the hardware was more than sufficient. In practice, the cluster often behaved as if only a fraction of the capacity was available.

Symptoms

GPU utilization frequently sat between 18–35%.
CPU utilization on several nodes reached 85–95%.
P95 latency swung between 120 ms and 480 ms under load.
Throughput varied noticeably between otherwise identical GPUs.
Performance degraded sharply during traffic spikes and promotional events.

Initial assessment

The first step was to align all available signals on a common timeline: GPU utilization, CPU utilization, request rate, and queue depth. Even before enabling deeper profiling, this alignment revealed “empty slots” between sequences of GPU kernels. In other words, the GPUs were waiting.

We then reviewed how requests were batched, how data moved between host and device, and how work was scheduled across nodes.

Investigation path

1. Batching behavior under real traffic

The autoscaler and batching logic had been tuned on synthetic load tests, not on real traffic. Under certain burst patterns, the system generated many micro-batches of size 1–2 instead of the intended larger batches. This hurt GPU saturation and created unnecessary overhead on both CPU and GPU.

2. Data movement and host–device transfers

By tracing host–device transfers, we discovered redundant copies before every inference call. Several intermediate representations were serialized and then deserialized again, even when the data followed the same path. This created an extra 8–12 ms of overhead per batch on some code paths.

3. Kernel-level fragmentation

Profiling at the kernel level showed many small kernels with significant launch overhead. Individually they were inexpensive, but together they filled the timeline with gaps instead of a steady block of useful work.

4. NUMA and node-level considerations

Three GPUs were bound to CPU NUMA nodes that also hosted heavier application logic and background tasks. That imbalance amplified existing bottlenecks and made those GPUs appear “slower” than the others.

Key findings

Batch sizing collapsed under real traffic, creating micro-batches.
Redundant host–device copies added latency and wasted PCIe bandwidth.
Kernel fragmentation and launch overhead left gaps on the GPU timeline.
NUMA imbalance contributed to cross-GPU variance in throughput.

Changes applied

Reworked batch construction to avoid micro-batches at realistic QPS levels.
Removed several unnecessary serialization and copy steps on the host side.
Grouped certain small kernels into larger units where safe and practical.
Adjusted CPU affinity to even out pressure across NUMA nodes.

Before & after

Metric	Before	After
GPU utilization	18–35%	65–78%
P95 latency	120–480 ms (highly variable)	180–220 ms (stable under load)
Cross-GPU variance	Up to 2.1× difference	Approx. 1.1×
Launch-period incidents	Frequent manual intervention	No incident during the launch window

Lessons learned

In this case, there was nothing “wrong” with the GPUs themselves. The real issues were in batching, data movement, scheduling, and NUMA layout. Once those were addressed, the hardware behaved much closer to expectations.

The main takeaway is that low utilization does not automatically mean the GPU is the bottleneck. The most practical question is often: what is the GPU waiting for? Answering that precisely is the core of our work.

← Back to case overview