V-JEPA 2 trains a vision model by predicting masked segments in one million hours of YouTube videos to learn real-world physics in latent space.
The model’s core components include a billion-parameter encoder, a predictor for masked video tokens, and specialized 3D position embeddings.
After initial self-supervised training, attaching a 300M-parameter transformer and 62 hours of raw robot videos enables zero-shot robotic tasks like reaching, grasping, and pick-and-place with high success rates.
Action planning with V-JEPA 2-AC uses model predictive control, generating sequences in seconds per action compared to minutes for diffusion models.
Limitations include sensitivity to camera pose, drift over long planning horizons, and the need for image goals instead of natural language instructions.
Get notified when new stories are published for "🇺🇸 Hacker News English"