We discover large-scale coaching of generative fashions on video knowledge. Particularly, we teach text-conditional diffusion fashions collectively on movies and pictures of variable intervals, resolutions and facet ratios. We leverage a transformer structure that operates on spacetime patches of video and symbol latent codes. Our biggest fashion, Sora, is able to producing a minute of prime constancy video. Our effects counsel that scaling video technology fashions is a promising trail in opposition to construction normal function simulators of the bodily global.

