Sora 是 OpenAI 发布的一个文本到视频的生成模型,可以根据描述性的文本提示生成高质量的视频内容。其能力标志着人工智能在创意领域的重大飞跃,有望将简单的文本描述转变为丰富的动态视频内容。
Sora 模型的发布,在技术界引起了广泛的关注和讨论,但目前 OpenAI 并没有公开发布 Sora 的计划,而是选择仅向少数研究人员和创意人士提供有限的访问权限,以便获取他们的使用反馈并评估技术的安全性。
We explore large-scale training of generative models on video data.Specifically,we train text-conditional diffusion models jointly on videos and images of variable durations,resolutions and aspect ratios.We leverage a Transformer architecture that operates on spacetime patches of video and image latent codes.Our largest model,Sora,is capable of generating a minute of high fidelity video.Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
This technical report focuses on(1)our method for turning visual data of all types into a unified representation that enables large-scale training of generative models,and(2)qualitative evaluation of Sora’s capabilities and limitations.Model and implementation details are not included in this report.
Sora is a diffusion model; given input noisy patches(and conditioning information like text prompts),it’s trained to predict the original “clean” patches.Importantly,Sora is a diffusion transformer.Transformers have demonstrated remarkable scaling properties across a variety of domains,including language modeling,computer vision,and image generation.
In this work,we find that diffusion transformers scale effectively as video models as well.Below,we show a comparison of video samples with fixed seeds and inputs as training progresses.Sample quality improves markedly as training compute increases.Base compute[block_sep]4x compute[block_sep]32x compute