V-JEPA / I-JEPA - Joint Embedding Predictive Architecture Demo

What is JEPA?

Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework by Meta AI that learns image/video representations by predicting abstract patch representations rather than pixel values.

How this demo works

Upload an image
Adjust the mask ratio (fraction of patches to hide)
The model (I-JEPA ViT-H/14) processes the image:
- Left: Shows which patches are masked (grey regions with red borders)
- Right: Attention rollout heatmap showing how the encoder distributes attention across patches. White borders mark the masked regions.

The attention map reveals the model's learned spatial reasoning: how it relates visible context to predict missing regions.

Model: facebook/ijepa_vith14_1k (ViT-Huge, 632M params, ImageNet-1K pretrained)

References: I-JEPA (Assran et al., 2023) | V-JEPA (Bardes et al., 2024) | V-JEPA 2 (Meta, 2025)