智谱开源的Sora质量如何？CogVideoX能否与Sora较量

智谱AI宣布开源其最新视频生成模型CogVideoX，旨在加速视频生成技术的发展和应用。CogVideoX基于先进的大型模型技术，能够满足商业级需求。目前开源的CogVideoX-2B版本在FP-16精度下进行推理时仅需18GB显存，微调则需40GB显存。这意味着使用单张4090显卡即可完成推理，而单张A6000显卡则能够进行微调。

目前，CogVideoX的提示词上限为226个token，生成的视频时长为6秒，帧率为8帧/秒，分辨率为720×480。为了训练CogVideoX模型，智谱AI开发了一套方法来筛选高质量的视频数据，排除了过度编辑和运动不连贯等问题视频，从而确保了模型训练数据的质量。此外，智谱AI还建立了一条从图像字幕生成视频字幕的管道，有效解决了视频数据缺乏文本描述的问题。但总的来说，生成质量较差，但看在他开源的份上，我们也一起了解学习一下。（Github和huggingface在文章底部）

技术原理

视频数据包含大量信息，因此需要巨大的存储空间和计算能力。为了解决这一问题，3D VAE结构被引入，包括三个部分：编码器将复杂视频转化为简化代码，解码器根据代码重建视频，潜在空间正则化器确保编码器与解码器之间的信息传递准确。为解决大分辨率泛化和增加帧数带来的困难，模型被分为两个阶段进行预训练，训练结束后图像上仍存在水印等信息，然后通过上下文并行在更高帧率上进行微调，最终实现更精细且无水印的画面。

Transformer通过以下流程实现文字和视频的同步处理并生成视频：首先，模型使用T5模型将文字和视频分别转换成代码；然后将这些代码结合在一起；接着，Transformer模型的多个部分负责处理不同任务，如视频的空间和时间信息、控制信息流动等，通过多层次处理逐渐提取出更有用的信息；最后，将处理后的代码反向转换回原始形式，生成最终的视频。

适合作为训练数据的视频应为动态连贯、自然的真实世界纪录片，但大多数视频因剪辑干扰（如转场、拼接、特效）和拍摄扰动（如抖动、设备不佳）而不达标，因此需要根据以下特征清洗：转换代码、结合代码、编辑痕迹、缺乏运动连贯性、低质量、讲课类型、以文本为主、噪声截图，最终对2万个视频样本进行标注，并基于video-llama训练多个过滤器和模型对视频进行光流和美学评分，确保生成视频的流畅性和美感。

示例效果

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.

In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.

THE END

AI资讯