This anonymized example covers a real-time video processing pipeline built on consumer-grade GPUs. The system handled high-resolution input streams and multiple stages of processing before delivering frames to end users.
The team used several machines equipped with NVIDIA GeForce RTX 4090 GPUs to process live and recorded video. The pipeline included decoding, resizing, color transforms, and GPU-accelerated effects before encoding the output.
Under moderate load the system behaved well, but once additional channels were added, the output became noticeably unstable.
We first constructed a frame-level timeline: CPU preprocessing, GPU kernels, and encode stages for each frame. This made it clear that CPU-side preprocessing was not keeping up when more channels were enabled.
By tracking allocations over time, we observed a pattern of frequent allocate / free cycles across a wide range of buffer sizes. This led to fragmentation and allocator fallbacks, causing unpredictable delays when new buffers were needed.
The GPU timeline contained a large number of very small kernels. Each individual kernel cost little, but launching hundreds of them per frame resulted in a lot of overhead and idle gaps.
Several expensive device-to-host copies happened before per-frame work was complete. These transfers were originally added for debugging and had never been consolidated or removed.
| Metric | Before | After |
|---|---|---|
| Frame rate stability | 18–60 FPS (high jitter) | 58–60 FPS (stable) |
| VRAM usage | 7–20 GB with spikes | Approx. 10–12 GB, steady |
| Kernel launches per frame | Hundreds of tiny kernels | Reduced by ~5× |
| Viewer-perceived jitter | Frequent and noticeable | Effectively eliminated |
Real-time video workloads are sensitive to small sources of variability. Fragmented VRAM, many small kernels, and unnecessary transfers can all contribute to the perception of “random” jitter. When those factors are brought under control, the hardware often performs as expected.