Case study · 4090 video processing pipeline

This anonymized example covers a real-time video processing pipeline built on consumer-grade GPUs. The system handled high-resolution input streams and multiple stages of processing before delivering frames to end users.

Background

The team used several machines equipped with NVIDIA GeForce RTX 4090 GPUs to process live and recorded video. The pipeline included decoding, resizing, color transforms, and GPU-accelerated effects before encoding the output.

Under moderate load the system behaved well, but once additional channels were added, the output became noticeably unstable.

Symptoms

Frame rate would drop from 60 FPS to the 18–42 FPS range at times.
Viewers experienced visible jitter and inconsistent motion.
VRAM usage fluctuated between roughly 7 GB and 20 GB.
Profiling showed many short GPU kernels with gaps in between.
Occasional spikes in device-to-host transfers appeared on the timeline.

Investigation path

1. Frame-by-frame timeline

We first constructed a frame-level timeline: CPU preprocessing, GPU kernels, and encode stages for each frame. This made it clear that CPU-side preprocessing was not keeping up when more channels were enabled.

2. Memory behavior and VRAM fragmentation

By tracking allocations over time, we observed a pattern of frequent allocate / free cycles across a wide range of buffer sizes. This led to fragmentation and allocator fallbacks, causing unpredictable delays when new buffers were needed.

3. Kernel launch patterns

The GPU timeline contained a large number of very small kernels. Each individual kernel cost little, but launching hundreds of them per frame resulted in a lot of overhead and idle gaps.

4. Data transfers at the wrong stage

Several expensive device-to-host copies happened before per-frame work was complete. These transfers were originally added for debugging and had never been consolidated or removed.

Key findings

CPU preprocessing was serialized in places and became a hidden bottleneck.
VRAM fragmentation caused allocator slow paths at random times.
Excessive small-kernel launches fragmented the GPU timeline.
Unnecessary device-to-host copies added jitter to the frame stream.

Changes applied

Introduced parallel preprocessing and moved some work onto the GPU.
Reused buffers across frames and normalized buffer sizes where possible.
Combined several small kernels into larger ones, when safe to do so.
Moved diagnostic transfers out of the hot path and removed those no longer needed.

Before & after

Metric	Before	After
Frame rate stability	18–60 FPS (high jitter)	58–60 FPS (stable)
VRAM usage	7–20 GB with spikes	Approx. 10–12 GB, steady
Kernel launches per frame	Hundreds of tiny kernels	Reduced by ~5×
Viewer-perceived jitter	Frequent and noticeable	Effectively eliminated

Lessons learned

Real-time video workloads are sensitive to small sources of variability. Fragmented VRAM, many small kernels, and unnecessary transfers can all contribute to the perception of “random” jitter. When those factors are brought under control, the hardware often performs as expected.

← Back to case overview