Although the effect is rather crude, the system offers an early glimpse of what’s coming next for generative artificial intelligence, and it is the next obvious step from the text-to-image AI systems that have caused huge excitement this year.
Meta’s announcement of Make-A-Video, which is not yet being made available to the public, will likely prompt other AI labs to release their own versions. It also raises some big ethical questions.
In the last month alone, AI lab OpenAI has made its latest text-to-image AI system DALL-E available to everyone, and AI startup Stability.AI launched Stable Diffusion, an open-source text-to-image system.
But text-to-video AI comes with some even greater challenges. For one, these models need a vast amount of computing power. They are an even bigger computational lift than large text-to-image AI models, which use millions of images to train, because putting together just one short video requires hundreds of images. That means it’s really only large tech companies that can afford to build these systems for the foreseeable future. They’re also trickier to train, because there aren’t large-scale data sets of high-quality videos paired with text.
To work around this, Meta combined data from three open-source image and video data sets to train its model. Standard text-image data sets of labeled still images helped the AI learn what objects are called and what they look like. And a database of videos helped it learn how those objects are supposed to move in the world. The combination of the two approaches helped Make-A-Video, which is described in a non-peer-reviewed paper published today, generate videos from text at scale.
Tanmay Gupta, a computer vision research scientist at the Allen Institute for Artificial Intelligence, says Meta’s results are promising. The videos it’s shared show that the model can capture 3D shapes as the camera rotates. The model also has some notion of depth and understanding of lighting. Gupta says some details and movements are decently done and convincing.
However, “there’s plenty of room for the research community to improve on, especially if these systems are to be used for video editing and professional content creation,” he adds. In particular, it’s still tough to model complex interactions between objects.
In the video generated by the prompt “An artist’s brush painting on a canvas,” the brush moves over the canvas, but strokes on the canvas aren’t realistic. “I would love to see these models succeed at generating a sequence of interactions, such as ‘The man picks up a book from the shelf, puts on his glasses, and sits down to read it while drinking a cup of coffee,’” Gupta says.