Skip to content

Revisiting Feature Prediction for Learning Visual Representations from Video

Authors: Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas

Published: 2024 (Preprint)

Source: arXiv

Algorithm: V-JEPA

arXiv: 2404.08471

Summary

V-JEPA tests whether masked feature prediction alone can learn useful video representations, removing common auxiliary signals such as contrastive negatives, text supervision, reconstruction targets, or pretrained image encoders. The result is a video-trained JEPA model whose frozen backbone transfers to both motion-heavy and appearance-heavy tasks, making it a clean example of latent-space prediction as a self-supervised visual objective.

Abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Tags

  • Self-supervised learning

  • Video representation learning

  • Joint embedding predictive architecture

  • V-JEPA

  • Feature prediction

  • Computer vision

  • Masked prediction

  • Video understanding

  • Frozen backbone evaluation