Why you can trust us

Engadget has been testing and reviewing consumer tech since 2004. Our stories may include affiliate links; if you buy something through a link, we may earn a commission. Read more about how we evaluate products.

Google DeepMind's new AI tech will generate soundtracks for videos

It works with AI-generated videos and traditional footage.


Google's DeepMind artificial intelligence laboratory is working on a new technology that can generate soundtracks, even dialogue, to go along with videos. The lab has shared its progress on the video-to-audio (V2A) technology project, which can be paired with Google Veo and other video creation tools like OpenAI's Sora. In its blog post, the DeepMind team explains that the system can understand raw pixels and combine that information with text prompts to create sound effects for what's happening onscreen. To note, the tool can also be used to make soundtracks for traditional footage, such as silent films and any other video without sound.

DeepMind's researchers trained the technology on videos, audios and AI-generated annotations that contain detailed descriptions of sounds and dialogue transcripts. They said that by doing so, the technology learned to associate specific sounds with visual scenes. As TechCrunch notes, DeepMind's team isn't the first to release an AI tool that can generate sound effects — ElevenLabs released one recently, as well — and it won't be the last. "Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional," the team writes.

While the text prompt is optional, it can be used to shape and refine the final product so that it's as accurate and as realistic as possible. You can enter positive prompts to steer the output towards creating sounds you want, for instance, or negative prompts to steer it away from the sounds you don't want. In the sample below, the team used the prompt: "Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete.

The researchers admit that they're still trying to address their V2A technology's existing limitations, like the drop in the output's audio quality that can happen if there are distortions in the source video. They're also still working on improving lip synchronizations for generated dialogue. In addition, they vow to put the technology through "rigorous safety assessments and testing" before releasing it to the world.