AI-based text-to-image generators like Dall E-2 and Midjourney became the talk of the town when they produced astonishingly great images. Last month, an AI-generated artwork even won the top prize in the digital art category. The advancements in the past six months alone show that AI-based text-to-image generators have gotten really good.
However, the question of whether these AI-based text-to-image generators can also be used to make videos remained unanswered. Meta, the parent company of Facebook, is trying to lead the way for generating video from text. The social media giant has introduced a new AI system that generates video from text called Make-A-Video.
Video from text becomes reality
Make-A-Video, according to Meta, builds on the advancements seen with text-to-image generation technology. The company has essentially re-engineered the same tool to produce videos from text. The system works by using images with description to learn what the world looks like and how it is often described.
Meta is also using unlabelled videos to learn how the world moves. “With this data, Make-A-Video lets you bring your imagination to life by generating whimsical, one-of-a-kind videos with just a few words or lines of text,” the company wrote in a blog post.
In a research paper, the creators of Make-A-Video state that the tool offers three distinct advantages:
- It accelerates training of the text-to-video model (it does not need to learn visual and multimodal representations from scratch)
- It does not require paired text-video data
- The generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models
How does Make-A-Video work?
In the research paper, Meta researchers explain that they were able to turn text into video by designing a “simple yet efficient” way to build on text-to-image models. They are calling their approach novel and relies on effective use of spatial-temporal modules.
The research paper states that they first decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second step is to design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder. Meta is also using interpolation models and two super resolution models to enable various applications.
“In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures,” the research paper reads.
What can Make-A-Video do?
On its dedicated page, Meta says that the Make-A-Video program that turns text into video is designed to bring “imagination to life and create one-of-a-kind videos”. By this, Meta is essentially referring to videos across three distinct genres: surreal, realistic, and stylised.
By surreal, Meta is referring to a teddy bear painting a portrait or robot dancing in times square or cat watching TV with a remote in hand. “A fluffy baby sloth with an orange knitted hat trying to figure out a laptop close up highly detailed studio lighting screen reflecting in its eye” is one of the examples shared by Meta.
For realistic, Meta is referring to an artist brush painting on a canvas close up or clown fish swimming through the coral reef. These videos are designed to be realistic and not just designed for social media. The last option is essentially stylising an existing content or creating sci-fi content.
“Hyper-realistic spaceship landing on mars” is one of the examples but it won’t just make it to a movie like The Martian yet.
In a nutshell, Meta is using AI to turn text (which is static) into video (which is dynamic). “Add motion to a single image or fill-in the in-between motion to two images,” Meta says on its website.
Meta is also pitching this new tool as a way to add extra creativity to your video. Make-A-Video is essentially a sign of what’s to come from Meta and how it could extend to its platforms like Facebook, WhatsApp, Instagram, and even Oculus. With Meta’s focus shifted to metaverse, it needs to make its AI tools fast and effective. With UK watchdog forcing Meta to divest Giphy, this tool could even become a short-term alternative for user-generated GIFs.