V-JEPA / I-JEPA - Joint Embedding Predictive Architecture Demo
What is JEPA?
Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework by Meta AI that learns image/video representations by predicting abstract patch representations rather than pixel values.
How this demo works
- Upload an image
- Adjust the mask ratio (fraction of patches to hide)
- The model (I-JEPA ViT-H/14) processes the image:
- Left: Shows which patches are masked (grey regions with red borders)
- Right: Attention rollout heatmap showing how the encoder distributes attention across patches. White borders mark the masked regions.
The attention map reveals the model's learned spatial reasoning: how it relates visible context to predict missing regions.
Model: facebook/ijepa_vith14_1k (ViT-Huge, 632M params, ImageNet-1K pretrained)
0.1 0.9