Screenwriters denied the big budgets and formidable resources of the major film studios may soon have another option, thanks to a new algorithm that can generate a video simply by consuming a (very short) script. The new movies are far from Oscar-worthy, but a similar technique could one day find uses outside entertainment, by, say, helping a witness reconstruct a car crash or a crime.
Artificial intelligence (AI) is getting much better at identifying the content of images and providing labels. So-called “generative” algorithms go the other way, producing images from labels (or brain scans). A few can even take a single movie frame and predict the next series of frames. But putting it all together—creating an image from text and making it move realistically in accordance with the text—has not been done before.
“As far as I know, it’s the first text-to-video work that gives such good results. They are not perfect, but at least they start to look like real videos,” says Tinne Tuytelaars, a computer scientist at Katholieke Universiteit Leuven in Belgium, who has done her own video prediction research. “It’s really nice work.”
The new algorithm is a form of machine learning, which means it requires training. Specifically, it’s a neural network, or a series of layers of small computing elements that process data in a way reminiscent of the brain’s neurons. During training, software assesses its performance after each attempt, and feedback circulates through the millions of network connections to refine future computations.
This network operates in two stages “designed to mimic how humans create art,” the researchers write. The first stage uses the text to create a “gist” of the video, basically a blurry image of the background with a blurry blob where the main action takes place. The second stage takes both the gist and the text and produces a short video. During training, a second network acts as a “discriminator.” It sees the video generated to illustrate, say, “sailing on the sea,” alongside a real video of sailing on the sea, and it is trained to pick the real one. As it gets better, it becomes a harsher critic, and its feedback sets a higher bar for the generator network.
The researchers trained the algorithm on 10 types of scenes, including “playing golf on grass,” and “kitesurfing on the sea,” which it then roughly reproduced. Picture grainy VHS footage. Nevertheless, a simple classification algorithm correctly guessed the intended action among six choices about half the time. (Sailing and kitesurfing were often mistaken for each other.) What’s more, the network could also generate videos for nonsensical actions, such as “sailing on snow,” and “playing golf at swimming pool,” the team reported this month at a meeting of the Association for the Advancement of Artificial Intelligence in New Orleans, Louisiana.
“Their methods are very interesting, combining the two stages,” says Hamed Pirsiavash, a computer scientist at the University of Maryland in Baltimore County, who has done video prediction work. “It’s a super difficult problem. So, I’m glad that these guys have made good progress.”
Currently, the videos are only 32 frames long—lasting about 1 second—and the size of a U.S. postage stamp, 64 by 64 pixels. Anything larger reduces accuracy, says Yitong Li, a computer scientist at Duke University in Durham, North Carolina, and the paper’s first author. Because people often appear as distorted figures, a next step, he says, is using human skeletal models to improve movement.