Previously, we introduced MusicVAE, a hierarchical variational autoencoder over musical sequences. In this post, we demonstrate the use of MusicVAE to model a particular type of sequence: individual measures of General MIDI music with optional underlying chords.

General MIDI is a symbolic music representation that uses a standard set of 128 instrument sounds; this restriction to predefined instruments like “Honky-Tonk Piano” and “SynthStrings 1” often results in a cheesy sound reminiscent of old video game music. We use General MIDI here as basic representation to explore polyphonic music generation with multiple instruments, not because we expect it to make a comeback.

With that out of the way, here is a CodePen that demonstrates a few of the things you can do with such a model:

What does the model do?

In the above CodePen, the model generates individual measures with up to 8 different instruments, conditioned on the underlying chord and a latent vector. By holding the latent vector fixed and changing the underlying chord, the model can generate an arrangement over a chord progression with consistent style. In the CodePen, you can also vary the latent representation between two random vectors to interpolate between the two styles over the same chord progression.

To generate a measure, the model first samples a latent vector (or obtains one some other way). A “conductor” LSTM decodes this latent vector into 8 track embeddings. Then, a track LSTM independently decodes each track conditioned on its embedding and the underlying chord.

The track decoder first a) chooses an instrument, represented as 128 General MIDI program numbers + drums, to use for the track, then b) chooses which notes to play and when, using a Performance RNN representation but with meter-relative instead of absolute time steps.

Despite our use of the word “style” above, we should point out that the model uses no explicit style labels and is taught nothing of style nor of any musical concepts other than those in the MIDI representation itself.

Chord Progression Examples

Here are two examples of the model decoding a single point in the latent space over a chord progression:

C G Bb F Ab Eb D G

C C+ Am E F Fm C G

And here’s an example of the model performing an interpolation in the latent space over a repeating chord progression (Dm F Am G):

Encoding Existing Measures

Because the model is a VAE, it can also encode existing measures. This makes it possible to perform latent space manipulations not just on generated samples, but on existing General MIDI music.

Here’s a CodePen for a non-chord-conditioned version of the model where you can perform latent space interpolation on your own encoded measures as well as random samples:

In case you don’t have any single-measure MIDI files just lying around, here are a few you can try importing:

Note that this model is not able to perfectly reconstruct the measures, but usually captures something about the musical style. As with other variational autoencoders, there’s an inherent tradeoff between sampling and reconstruction: a model that can more accurately reconstruct existing measures tends to generate less realistic samples, and vice versa.

Training Data

The model is trained on the Lakh MIDI Dataset, containing 170,000+ MIDI sequences. After splitting into individual measures, deduping, and removing measures unsupported by our representation (see below limitations), our training dataset contains about 4 million unique measures. MIDI files do not contain the chord symbols that we use for conditioning, so we use a custom procedure described in our arXiv paper (Python source code here) to estimate chords across the entire dataset.

Limitations

While the representation used by the model is intended to be quite general, there are still a few restrictions. Each measure must contain 8 or fewer tracks and must have a 4/4 time signature. It is then quantized to 96 steps (24 per quarter note). The model thus has no understanding of tempo as all timing is relative to the quarter note and BPM is discarded. None of these restrictions poses a serious impediment to potentially extending the model to handle a broader class of MIDI sequences.

The model’s most fundamental restriction is that imposed by General MIDI itself: the limitation to 128 instrument presets + drums. Real music contains instrument sounds selected from an essentially infinite set, and individual pieces of music often contain custom sounds not used anywhere else. We at Magenta are of course very interested in modeling music with such a diversity of sounds; the NSynth project is one effort to model and also expand the set of accessible instrument sounds, and we continue to work on other projects with the explicit goal of modeling musical audio more generally.

More Details

If you want to learn more, read our arXiv paper: https://arxiv.org/abs/1806.00195

Or check out our full page of examples: https://goo.gl/s2N7dV

How can I try it?

There are a few ways:

Use the above CodePens! This is probably the easiest way to get started. If you like an arrangement but would prefer to use your own sounds you can export to MIDI and open it in a DAW.
Make your own interactive experience using the model via magenta.js! You can take a look at the CodePens (or fork them) to get started.
Train one yourself. This is more involved, but all of the code necessary to do so is available in our open-source Python repo.

We look forward to hearing what you create!

Multitrack MusicVAE: Interactively Exploring Musical Styles