Human Learning What WaveNet Learned from Humans

(or Learning Music Learned From Music)

A few days ago, DeepMind posted audio synthesis results that included .wav files generated from a training data set of hours of solo piano music. Each wave file (near the bottom of their post) is 10 seconds long, and sounds very much like piano music. I took a closer look at these samples.

Musically, I found the clips fascinating to listen to and in particular, before even knowing anything about the training data itself, they made me curious about it— curious both as a machine learning researcher, and also as a musician. For example, many moments in the synthesized examples reminded me of Russian composers such as Skryabin… why was this? One way that scientists learn is by trying to replicate what other scientists have done; similarly, one way that musicians learn is by trying to replicate what other musicians have created. So, in order to better understand what was going on in these sounds, I started learning to play some of the clips by ear on piano (this is sometimes called lifting). I made the above video while working on WaveNet’s piano sample #3. (Disclaimer: this was done quickly— transcribing well is usually very time intensive— and roughly, with various inaccuracies, both at the note level but at other levels as well.) It is just a musical outline of the corresponding clip. The last couple of seconds are a bit of improvising I did just at that moment to continue the otherwise cut-off phrase.

I later recorded another variation of the same clip, this time on my electric keyboard, so that I could include a MIDI file for you to download as well.

This is the score generated by MuseScore when fed the raw MIDI file. The rhythm is wrong and pedaling is absent, but you get to see the notes.

Learning this clip led me to a few observations:

contours of the dynamics sound extremely smooth— both unnatural and at the same time musically extremely effective (reminds me of the incredible control of a pianist like Horowitz)
the notes are generally locally “playable” (I just mean this coarsely, i.e. it is possible to press those notes all at the same time). why? the data is from humans playing piano, so in any brief time interval (e.g. the input window size), the sample is itself playable, and so an accurate generative model of this data would likely also be playable if one looks at any similar-sized time window
at least one exception to playability is in “textures”: there are moments where (to my ears) there is a “sound” of piano but without clear beginnings of notes; this could be happening for a variety of reasons; for now, I “musically outlined” this effect by playing something suggestive of the texture that I was hearing (i.e. a perceptual approximation)
musically speaking, those non-piano sounds are some of the most interesting sounds for me
it’s quite hard to play at the speed and with the smoothness of the recording. one reason is that while the notes themselves are generally locally playable, the transitions between sections are fast, frequent and furious, even within each 10-second clip: that’s not easy to play!

I hope these short notes provide some clues and ideas for how to further play with this fun generative model. Please feel free to add your own observations/insights/questions/transcriptions/etc. We are interested in building new tools for creative people and collaborating with the artistic community; if you have ideas of how such a collaboration might look, at least from your end, please feel free to comment on the Magenta discussion list or contact me directly.

–Sageev

Sageev Oore, a visiting machine learning researcher on the Magenta team at Google Brain, is a professor in the Dept of Math & Computer Science at Saint Mary’s University (Halifax, Canada), and records and performs as a musician.