Robotics Breakthrough via One Million Hours of Videos

V-JEPA 2 trains a vision model by predicting masked segments in one million hours of YouTube videos to learn real-world physics in latent space.

The model’s core components include a billion-parameter encoder, a predictor for masked video tokens, and specialized 3D position embeddings.

After initial self-supervised training, attaching a 300M-parameter transformer and 62 hours of raw robot videos enables zero-shot robotic tasks like reaching, grasping, and pick-and-place with high success rates.

Action planning with V-JEPA 2-AC uses model predictive control, generating sequences in seconds per action compared to minutes for diffusion models.

Limitations include sensitivity to camera pose, drift over long planning horizons, and the need for image goals instead of natural language instructions.

Subscribe to Similar Stories

Get notified when new stories are published for "🇺🇸 Hacker News English"

No Sign-In needed. One-Click Subscribe.

•

🇺🇸 Hacker News English•June 30, 2025 at 04:41 AM

Robotics Breakthrough via One Million Hours of Videos

V-JEPA 2 trains a vision model by predicting masked segments in one million hours of YouTube videos to learn real-world physics in latent space.

The model’s core components include a billion-parameter encoder, a predictor for masked video tokens, and specialized 3D position embeddings.

Action planning with V-JEPA 2-AC uses model predictive control, generating sequences in seconds per action compared to minutes for diffusion models.

Limitations include sensitivity to camera pose, drift over long planning horizons, and the need for image goals instead of natural language instructions.

Subscribe to Similar Stories

Get notified when new stories are published for "🇺🇸 Hacker News English"

No Sign-In needed. One-Click Subscribe.

Robotics Breakthrough via One Million Hours of Videos