One of the difficult problems in using machine learning to generate sequences, such as melodies, is creating long-term structure. Long-term structure comes very naturally to people, but it’s very hard for machines. Basic machine learning systems can generate a short melody that stays in key, but they have trouble generating a longer melody that follows a chord progression, or follows a multi-bar song structure of verses and choruses. Likewise, they can produce a screenplay with grammatically correct sentences, but not one with a compelling plot line. Without long-term structure, the content produced by recurrent neural networks (RNNs) often seems wandering and random.

But what if these RNN models could recognize and reproduce longer-term structure? Could they produce content that feels more meaningful – more human? Today we’re open-sourcing two new Magenta models, Lookback RNN and Attention RNN, both of which aim to improve RNNs’ ability to learn longer-term structures. We hope you’ll join us in exploring how they might produce better songs and stories.

# Lookback RNN

Lookback RNN introduces custom inputs and labels. The custom inputs allow the model to more easily recognize patterns that occur across 1 and 2 bars. They also help the model recognize patterns related to where in the measure an event occurs. The custom labels make it easier for the model to repeat sequences of notes without having to store them in the RNN’s cell state. The type of RNN cell used in this model is an LSTM.

In our introductory model, Basic RNN, the input to the model was a one-hot vector of the previous event, and the label was the target next event. The possible events were note-off (turn off any currently playing note), no event (if a note is playing, continue sustaining it, otherwise continue silence), and a note-on event for each pitch (which also turns off any other note that might be playing). In Lookback RNN, we add the following additional information to the input vector:

• In addition to inputting the previous event, we also input the events from 1 and 2 bars ago. This allows the model to more easily recognize patterns that occur across 1 and 2 bars, such as mirrored or contrasting melodies.

• We also input whether the last event was repeating the event from 1 or 2 bars before it. This signals if the last event was creating something new, or just repeating an already established melody. This allows the model to more easily recognize patterns associated with being in a repetitive or non-repetitive state.

• We also input the current position within the measure (as done previously by Daniel Johnson), allowing the model to more easily learn patterns associated with 4/4 time music. These inputs are 5 values that can be thought of as a binary step clock.
Step 1: $[0, 0, 0, 0, 1]$
Step 2: $[0, 0, 0, 1, 0]$
Step 3: $[0, 0, 0, 1, 1]$
Step 4: $[0, 0, 1, 0, 0]$
The only difference being the values are -1 and 1 instead of 0 and 1.

In addition to feeding the model more input information, we also add two new custom labels. The label to repeat the event from 1 bar ago and the label to repeat the event from 2 bars ago. This is where the Lookback RNN gets its name. When creating labels for the training data, if the current event in the melody is repeating the same event from 2 bars ago, we set the label for that step to be repeat-2-bars-ago. If it’s not repeating the event from 2 bars ago, we check if it’s repeating the event from 1 bar ago, and if so, we set the label for that step to be repeat-1-bar-ago. Only when the melody isn’t repeating 1 or 2 bars ago do we make the label for that step be a specific melody event. For example, if the third bar of the melody is completely repeating the first bar, every label for that third bar will be the repeat-2-bars-ago label. This allows the model to more easily repeat 1 or 2 bar phrases without having to store those sequences in its memory cell. Since a lot of melodies in popular music repeat events from 1 and 2 bars ago, these extra labels reduce the complexity of information the model has to learn to represent.

Here are some sample melodies generated by the Lookback RNN model when trained on a collection of popular music. The intro notes (played on the glockenspiel) were given to the model as a priming melody. The rest of the notes were generated.

To train the Lookback RNN on your own MIDI collection and generate your own melodies from it, follow the steps in the README

# Attention RNN

To learn even longer-term structure we can use attention. Attention is one of the ways that models can access previous information without having to store it in the RNN cell’s state. The RNN cell used in this model is an LSTM. The attention method used comes from the paper Neural Machine Translation by Jointly Learning to Align and Translate (D Bahdanau, K Cho, Y Bengio, 2014). In that paper, the model is an encoder-decoder RNN, and the model uses attention to look at all the encoder outputs during each decoder step. In our version, where we don’t have an encoder-decoder, we just always look at the outputs from the last $n$ steps when generating the output for the current step. The way we “look at” these steps is with an attention mechanism. Specifically:

The vector $v$ and matrices $W_1^\prime$, $W_2^\prime$ are learnable parameters of the model. $h_i$ are the RNN outputs from the previous $n$ steps $(h_{t-n},...,h_{t-1})$, and vector $c_t$ is the current step’s RNN cell state. These values are used to calculate $u_i^t$ $(u_{t-n}^t,...,u_{t-1}^t)$, an $n$ length vector with one value for each of the previous $n$ steps. The values represent how much attention each step should receive. A softmax is used to normalize these values and create a mask-like vector $a_i^t$, called the attention mask. The RNN outputs from the previous $n$ steps are then multiplied by these attention mask values and then summed together to get $h_t^\prime$. For example, let’s assume we are on the 4th step of our sequence and $n$ = 3, which means our attention mechanism is only looking at the last 3 steps. For this example, the RNN output vectors will be small 4 length vectors. If the RNN outputs from the first 3 steps are:

Step 1: $[1.0, 0.0, 0.0, 1.0]$
Step 2: $[0.0, 1.0, 0.0, 1.0]$
Step 3: $[0.0, 0.0, 0.5, 0.0]$

And our calculated attention mask is:

$a_{i}^{t}$ $= [0.7, 0.1, 0.2]$

Then the previous step would get 20% attention, 2 steps ago would get 10% attention, and 3 steps ago would get 70% attention. So their masked values would be:

Step 1 (70%): $[0.7, 0.0, 0.0, 0.7]$
Step 2 (10%): $[0.0, 0.1, 0.0, 0.1]$
Step 3 (20%): $[0.0, 0.0, 0.1, 0.0]$

And then they’d be summed together to get $h_t^\prime$:

$h_t^\prime$ $= [0.7, 0.1, 0.1, 0.8]$

The $h_t^\prime$ vector is essentially all $n$ previous outputs combined together, but each output contributing a different amount relative to how much attention that step received.

This $h_t^\prime$ vector is then concatenated with the RNN output from the current step and a linear layer is applied to that concatenated vector to create the new output for the current step. Some attention models only apply this $h_t^\prime$ vector to the RNN output, but in our model, as is also sometimes done, this $h_t^\prime$ vector is also applied to the input of the next step. The $h_t^\prime$ vector is concatenated with the next step’s input vector and a linear layer is applied to that concatenated vector to create the new input to the RNN cell. This helps attention not only affect the data coming out of the RNN cell, but also the data being fed into the RNN cell.

This $h_t^\prime$ vector, which is a combination of the outputs from the previous $n$ steps, is how attention can directly inject information from those previous steps into the current step’s network of calculations, making it easier for the model to learn longer-term dependencies without having to store all that information from those previous steps in the RNN cell’s state. If you’d like an even deeper understanding of the whole attention process, you can walk through the code to see exactly what’s happening.

Here are some sample melodies generated by the Attention RNN model when trained on a collection of popular music. These melodies were all primed with the first four notes of Twinkle Twinkle Little Star, then the rest of the notes were generated by the model:

Melody 1 and 2 were combined in a standard song format, AABA, and backed up by drums to create the following song sample:

Jason Nguyen (@SoulGook), on the đàn bầu, and Alex Koman (@meloscribe), on guitar, added to that song to create this man and machine collaboration:

The following song uses the three Attention RNN melodies listed above by layering them all together. They compliment each other surprisingly well. The drums and bass line were added by a human. This demonstrates how musicians could use these generated melodies for building out larger, more elaborate songs.

To train the Attention RNN on your own MIDI collection and generate your own melodies from it, follow the steps in the README on GitHub.

These models improve on the initial Magenta Basic RNN by adding two forms of memory manipulation, simple lookback and learned attention. Nevertheless, a lot of work remains before Magenta models are writing complete pieces of music or telling long stories. Stay tuned for more improvements.

Edit (@elliotwaite Aug 8, 2016): Updated the reference for the attention method used to the paper that originally introduced the idea, Neural Machine Translation by Jointly Learning to Align and Translate (D Bahdanau, K Cho, Y Bengio, 2014).