OpenAI’s much-anticipated text-to-video model, Sora, is still a work in progress, but recent developments in lip sync technology have captured everyone’s interest. Are we on the brink of a breakthrough?
Lately, lip sync technology has been on the tip of everyone’s tongue. However, there is some uncertainty surrounding this term.
In the media industry, lip synchronization means matching a character’s mouth movements with sung or spoken vocals. Traditionally, it’s achieved by adjusting the wording, timing, and speed of each phrase uttered by the voice actor. But with AI’s latest developments, there’s a growing need to modify a video to synchronize the speaker’s lips with the dubbed speech.
This “visual” lip sync buzz gained momentum with several recent events. One was the AI video translation of the speech of Argentine President Javier Milei at the World Economic Forum.
Anticipation is mounting as everyone awaits a new showcase that will demonstrate AI’s potential to outperform humans. But how did the industry cope before the introduction of technology?
Brief history of dubbing
The practice of dubbing, which involves replacing the original language of a film or TV show with a translated version, began in the early days of cinema. The exact date of its inception is difficult to pinpoint, but with the advent of sound in cinema, it became more widespread in the 1920s. Initially, dubbing was primarily used for translating silent films into languages other than the original. As technology advanced, dubbing techniques improved, and it became a widespread practice for both films and TV shows worldwide.
Since the 1950s, Germany has embraced widespread dubbing practices, making original versions a rarity. For Germans, the cinematic world unfolded in German, seamlessly adapting dialogues to fit the post-war society’s mindset. There existed an unspoken agreement within society to downplay and suppress the nation’s Nazi history, war crimes, and genocide with meticulous precision. In each film where Germans were depicted negatively, their characters were given alternative backgrounds. For instance, the Nazis involved in uranium trading in Hitchcock’s “Notorious” were reimagined as international drug smugglers, reflected in the German title “White Poison.”
Visual lip sync: how does it work?
In visual lip sync, lip movements match the phonemes (distinct units of sound) produced by the character when speaking. Animators or designers create or manipulate the lip movements of characters to align with the spoken dialogue or sound effects. An ideal match is not always required, as viewers seem to focus only on some “anchor” moments.
How AI fuels the lip sync hype
For almost a century, broadcasters somehow managed perfectly well without visual lip-syncing, even for high-end productions. Visual manipulations were only possible for animated content, and even these were used very rarely. For viewers, this was obviously not a problem.
The AI hype, however, is fueling the obsession with lip sync technology, which is becoming a symbol of innovation. From content creators to major broadcasters, the ability to manipulate lip movements in live-action content opens up a myriad of possibilities for artistic expression.
Non-visual lip sync has also become relevant with the introduction of AI dubbing. When a voice in the target language is automatically generated, it doesn’t generally match the lip movements and, hence, requires additional manipulation. I call this ‘conditional generation’.
Why the technology remains a work in progress
For now, visual lipsync remains a work in progress. Advancements in computer vision are bringing us closer to the day when everything will fall into place, although it’s difficult to provide a precise timeline.
Current technology is based on models like Meta’s wav2vec 2.0, which learns a set of speech units to describe the speech audio sequence. These units are, in turn, used to generate matching image fragments of the area around a person’s mouth, and then blend them seamlessly into the video.
This approach encourages the model to focus on the most important factors to represent the speech audio. That’s why wav2vec and similar models work well for many languages and dialects, even those like Kyrgyz and Swahili which don’t have a lot of transcribed speech audio.
Why is AI-powered visual lip sync weird?
Currently, visual lip-syncing works quite well with a “talking head” scenario where the speaker looks directly into the camera. That’s why the technology finds smooth application in various generative marketing videos.
If the video is more complex, however, some creepy side effects emerge; for example, when the camera does not establish direct eye contact and captures the person’s mouth from a side angle.
Another example is when an object, such as a person’s hand, interrupts the view between the face and the camera, obscuring the mouth in a rather disconcerting manner. Also, wav2vec and similar technology don’t perform well when applied to 4K video.
What’s next?
Achieving exceptional quality in visual lip-syncing with complex content such as movies, films, and series is still far away. The human touch is still needed to achieve perfection. However, this technology offers exciting prospects for the future, particularly as generative video starts to become more widespread.
Although lip sync, both visual and non-visual, has recently garnered significant attention, it’s still a work in progress. Perhaps in the future, we’ll witness a convergence of the two techniques, allowing us to harness the strengths of both approaches for maximum impact.