Inspired by Steve Reich’s Music for 18 musicians, I used machine learning to create a visual to go along with it:

It uses videos recorded from train windows, with landscapes that moves from right to left, to train a machine learning (ML) algorithm. First, it learns how to predict the next frame of the videos, by analyzing examples. Then it produces a frame from a first picture, then another frame from the one just generated, etc. The output becomes the input of the next calculation step. So, except for the initial image that I chose, all the other frames were generated by the algorithm. In other words, the process is a feedback loop made of an artificial neural network.

After several requests, I finally found time to publish the code as part of the Magenta project

Motivation & Limitations

ML is a very exciting field of computer science, so even if I’m not an expert in that field I decided to give it a try. Creating a video with TensorFlow was a good challenge. Thanks to all the open source projects and courses available online, the entry bar to create something using ML is lower than ever before.

The algorithm used is not able to predict well the next frames for all types of videos. It needs to be trained on specific dataset. Because it uses only one frame to predict the next one, it works best if the optical flow is almost always the same from one frame to another. While the results are low resolution, blurry, and not realistic most of the time, for the train video, it can resonate with the feelings we have when we travel in a train. Unlike classical computer generated content, these patterns are not created by a deterministic algorithm (with perhaps some randomization) written by a software engineer. For instance, the foreground should move faster than the background in the generated video: thanks to Machine Learning and the data used to train the model, the algorithm managed to replicate that effect itself.

Predicting the future is what we humans do all the time: we are constantly predicting what may happen next. A perfect system able to predict the next frame of any video would be able to predict that a dropped glass would fall, and crash to the ground. Unless of course, the glass lands on a soft surface or someone catches it during its fall. So, predicting the next frame of a video may need to not only, in some sense, “understand” the physical behavior of objects in the video, it may also need to be able to consider multiple potential paths that the predicted video may take based on other data, e.g. the text or audio from the video, a final frame, etc. This ambiguity and difficulty is what makes the work of trying to predict future frames in a video so exciting. It’s important to note that the work presented in this post is more artistic and inspirational in nature, and not focused on scientific advancements in the area of video prediction and generation. While the imperfections of the current algorithms and models for video prediction make it difficult to use in some practical applications, these same imperfections are not only tolerable, but also valuable, for more artistic endeavors.

More details

The algorithm works by extracting frames of video and creating pairs of consecutives frames.

I trained the algorithm on several videos, made several attempts, but it was diverging from realistic frames and converging to something too abstract and static.

It was not surprising to see this happen as the errors simply kept accumulating. So, to prevent the process from diverging, I added another set of image pairs to the training set. Rather than adding two consecutive real frames, I added pairs made of a predicted frame and the corresponding real frame. This way, the algorithm learns to generate real frames from predicted ones and is able to compensate for the errors that make the model diverge. The new pairs look like this:

This technique is very similar to Scheduled Sampling. In practice, the main script will train a model, generate sequences that diverge, pair them with real sequences, and retrain until it’s stable. This way, the network learns a feedback loop to stabilize the video generation.

Source code

The source code is available as part of the Magenta project and can be found in the Magenta GitHub repository. It is based on this implementation of Pix2Pix. We look forward to seeing what you’ll create from there!

More examples

You can see more videos made with this technique on this Magenta YouTube playlist. If you want to see yours on this list please tag them with #MadeWithMagenta

Related work

  • Mario Klingemann is AFAIK, the first who published a video generated using Pix2Pix for next frame prediction.
  • To make the generation stay realistic the technique used share some similarity with Scheduled Sampling techniques.

About me

My name is Damien Henry, I’m one of the two engineers who started the Google Cardboard project. I’m the experiments team lead for the Google Arts & Culture Lab. We started Art Selfie.