Case study · mixed GPU rendering farm

This anonymized case covers a rendering farm built over time from a mix of GPU generations. The goal was to make the farm behave more like a single, well-balanced pool of capacity.

Background

The studio operated a small farm of around twenty GPUs: a mix of datacenter cards and high-end consumer cards with different performance profiles. Over the years, new machines had been added as needed, and the software stack had grown organically.

Jobs were submitted through a queueing system running on top of Kubernetes. On paper, the total GPU capacity was sufficient for the studio’s workload. In practice, artists often experienced long queues and unpredictable turnaround times.

Symptoms

Some machines were consistently saturated while others stayed underused.
Render times for similar jobs varied widely depending on where they landed.
Queues grew long during busy periods, even when GPUs were still idle.
It was difficult to estimate when a batch of jobs would finish.

Investigation path

1. Collecting basic telemetry

We began by collecting GPU utilization, job metadata, and queue statistics across the cluster. This quickly confirmed what the artists already felt: some nodes did far more work than others.

2. Understanding the scheduler

The existing scheduler treated all GPUs as if they were roughly equivalent. In reality, there were substantial differences in performance between cards. Heavy jobs could easily land on slower GPUs, while fast GPUs idled or handled lighter tasks.

3. Job and asset characteristics

We looked at which jobs caused the longest queues and the heaviest load. These often involved large textures and complex shading setups that were noticeably more sensitive to disk and network performance.

4. Data loading and I/O

Tracing the path of assets from storage to GPU showed that certain nodes had slower effective access to shared storage, which further amplified differences between machines.

Key findings

GPU performance tiers were not reflected in scheduling decisions.
Heavy jobs were not directed toward the strongest nodes.
Some nodes had slower access to shared storage, increasing render time.
Queueing was purely FIFO and did not account for job complexity.

Changes applied

Grouped GPUs into a small number of performance tiers and exposed this to the scheduler.
Directed the heaviest classes of jobs toward the strongest tier by default.
Adjusted placement for nodes with slower storage paths to avoid overloading them.
Added simple size-based hints to jobs so that large and small tasks mixed more predictably in the queue.

Before & after

Metric	Before	After
GPU load distribution	Highly uneven	Much more balanced across the farm
Average render time	Varied significantly per job	Reduced by roughly 20–25%
Queue length during busy periods	Frequently long and unpredictable	Noticeably shorter and more stable
Artist perception	Hard to know when work would finish	Turnaround felt more consistent and manageable

Lessons learned

Heterogeneous GPU fleets are a fact of life for many teams. Treating all cards as equivalent can hide a lot of capacity and create unnecessary frustration. A small amount of awareness in the scheduler goes a long way toward making a mixed farm feel like a unified resource.

← Back to case overview