V-JEPA / I-JEPA - Joint Embedding Predictive Architecture Demo

What is JEPA?

Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework by Meta AI that learns image/video representations by predicting abstract patch representations rather than pixel values.

How this demo works

  1. Upload an image
  2. Adjust the mask ratio (fraction of patches to hide)
  3. The model (I-JEPA ViT-H/14) processes the image:
    • Left: Shows which patches are masked (grey regions with red borders)
    • Right: Attention rollout heatmap showing how the encoder distributes attention across patches. White borders mark the masked regions.

The attention map reveals the model's learned spatial reasoning: how it relates visible context to predict missing regions.

Model: facebook/ijepa_vith14_1k (ViT-Huge, 632M params, ImageNet-1K pretrained)

0.1 0.9