Case study · 8×A100 inference cluster

An anonymized example from a production inference environment that relied on 8× A100 GPUs across two nodes. The service was revenue-critical and needed stable performance during a launch period.

Background

The team ran a real-time video and image inference API. Traffic came in bursts, peaking between 1.2k and 3.4k requests per second. The system used a familiar stack: PyTorch for training, exported to ONNX, then to TensorRT for deployment. Two nodes each hosted 4× A100 80GB GPUs behind a Kubernetes-based orchestration layer.

On paper, the hardware was more than sufficient. In practice, the cluster often behaved as if only a fraction of the capacity was available.

Symptoms

Initial assessment

The first step was to align all available signals on a common timeline: GPU utilization, CPU utilization, request rate, and queue depth. Even before enabling deeper profiling, this alignment revealed “empty slots” between sequences of GPU kernels. In other words, the GPUs were waiting.

We then reviewed how requests were batched, how data moved between host and device, and how work was scheduled across nodes.

Investigation path

1. Batching behavior under real traffic

The autoscaler and batching logic had been tuned on synthetic load tests, not on real traffic. Under certain burst patterns, the system generated many micro-batches of size 1–2 instead of the intended larger batches. This hurt GPU saturation and created unnecessary overhead on both CPU and GPU.

2. Data movement and host–device transfers

By tracing host–device transfers, we discovered redundant copies before every inference call. Several intermediate representations were serialized and then deserialized again, even when the data followed the same path. This created an extra 8–12 ms of overhead per batch on some code paths.

3. Kernel-level fragmentation

Profiling at the kernel level showed many small kernels with significant launch overhead. Individually they were inexpensive, but together they filled the timeline with gaps instead of a steady block of useful work.

4. NUMA and node-level considerations

Three GPUs were bound to CPU NUMA nodes that also hosted heavier application logic and background tasks. That imbalance amplified existing bottlenecks and made those GPUs appear “slower” than the others.

Key findings

Changes applied

Before & after

Metric Before After
GPU utilization 18–35% 65–78%
P95 latency 120–480 ms (highly variable) 180–220 ms (stable under load)
Cross-GPU variance Up to 2.1× difference Approx. 1.1×
Launch-period incidents Frequent manual intervention No incident during the launch window

Lessons learned

In this case, there was nothing “wrong” with the GPUs themselves. The real issues were in batching, data movement, scheduling, and NUMA layout. Once those were addressed, the hardware behaved much closer to expectations.

The main takeaway is that low utilization does not automatically mean the GPU is the bottleneck. The most practical question is often: what is the GPU waiting for? Answering that precisely is the core of our work.