Case study · 4090 video processing pipeline

This anonymized example covers a real-time video processing pipeline built on consumer-grade GPUs. The system handled high-resolution input streams and multiple stages of processing before delivering frames to end users.

Background

The team used several machines equipped with NVIDIA GeForce RTX 4090 GPUs to process live and recorded video. The pipeline included decoding, resizing, color transforms, and GPU-accelerated effects before encoding the output.

Under moderate load the system behaved well, but once additional channels were added, the output became noticeably unstable.

Symptoms

Investigation path

1. Frame-by-frame timeline

We first constructed a frame-level timeline: CPU preprocessing, GPU kernels, and encode stages for each frame. This made it clear that CPU-side preprocessing was not keeping up when more channels were enabled.

2. Memory behavior and VRAM fragmentation

By tracking allocations over time, we observed a pattern of frequent allocate / free cycles across a wide range of buffer sizes. This led to fragmentation and allocator fallbacks, causing unpredictable delays when new buffers were needed.

3. Kernel launch patterns

The GPU timeline contained a large number of very small kernels. Each individual kernel cost little, but launching hundreds of them per frame resulted in a lot of overhead and idle gaps.

4. Data transfers at the wrong stage

Several expensive device-to-host copies happened before per-frame work was complete. These transfers were originally added for debugging and had never been consolidated or removed.

Key findings

Changes applied

Before & after

Metric Before After
Frame rate stability 18–60 FPS (high jitter) 58–60 FPS (stable)
VRAM usage 7–20 GB with spikes Approx. 10–12 GB, steady
Kernel launches per frame Hundreds of tiny kernels Reduced by ~5×
Viewer-perceived jitter Frequent and noticeable Effectively eliminated

Lessons learned

Real-time video workloads are sensitive to small sources of variability. Fragmented VRAM, many small kernels, and unnecessary transfers can all contribute to the perception of “random” jitter. When those factors are brought under control, the hardware often performs as expected.