Skip to content

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Authors: Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

Published: 2026 (Preprint)

Source: arXiv

Algorithm: ChopGrad

arXiv: 2603.17812

Summary

ChopGrad makes pixel-domain fine-tuning practical for recurrent latent video diffusion decoders by truncating gradient flow to local temporal windows while still using the recurrent state for global consistency. The paper combines a locality argument for why distant frame gradients can be ignored with experiments across video super-resolution, inpainting, neural-rendering enhancement, and controlled driving-video generation, turning a linear-in-length memory bottleneck into an approximately constant-memory training procedure.

Abstract

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

Tags

  • Video diffusion

  • Latent diffusion models

  • Truncated backpropagation

  • Pixel-wise losses

  • Memory-efficient training

  • Video generation

  • Video super-resolution

  • Video inpainting

  • Neural rendering

  • Computer vision