Onsets and Frames is our new model for automatic polyphonic piano music transcription. Using this model, we can convert raw recordings of solo piano performances into MIDI.
For example, have you ever made a recording of yourself improvising at the piano and later wanted to know exactly what you played? This model can automatically transcribe that piano recording into a MIDI pianoroll that could be used to play the same music on a synthesizer or as a starting point for sheet music. Automatic transcription opens up many new possibilities for analyzing music that isn’t readily available in notated form and for creating much larger training datasets for generative models.
We’re able to achieve a new state of the art by using CNNs and LSTMs to predict pitch onset events and then using those predictions to condition framewise pitch predictions.
|Model||Transcription F1 score (0–100)|
|Previous State of the Art||23.14|
|Onsets and Frames||50.22|
More metrics and details available in our paper.
The examples above are a good illustration of the performance of our system. There are definitely some mistakes, but it does a good job in terms of capturing harmony, melody and even rhythm.
The reason our model works as well as it does is because we split the task of note detection across two stacks of neural networks: one stack is trained to detect only onset frames (the first few frames of every note) and one stack is trained to detect every frame where a note is active. Previous models used only a single stack, but we found that by separating out the onset detection task we were able to achieve much higher accuracy.
We use the output from the onset detector in two ways: we feed the raw output of that detector into the frame detector as an additional input, and we also restrict the final output of the model to start new notes only when the onset detector is confident that a note onset is in that frame.
Our loss function is the sum of two cross-entropy losses: one from the onset side and one from the frame side. Within the frame-based loss term, we apply a weighting to encourage accuracy at the start of the note. Because the weight vector assigns higher weights to the early frames of notes, the model is incentivized to predict the beginnings of notes accurately, thus preserving the most important musical events of the piece.
The figure below illustrates the importance of restricting model output based on the onset detector. The first image shows the results from the frame and onset detectors. There are several examples of notes that either last for only a few frames or that reactivate briefly after being active for a while. The second image shows the frame results after being restricted by the onset detector. Most of the notes that were active for only a few frames did not have a corresponding onset detection and were removed. Cases where a note briefly reactivated after being active for a while were also removed because a second onset for that note was not detected.
The current quality of the model’s output is on the cusp of enabling downstream applications such as music information retrieval and automatic music generation. We’re working on improvements that we hope will make the model even more accurate and useful.
We’d love to hear about your experience with this model on the magenta-discuss list. Was there a transcription that produced particularly interesting results (good or bad)? Try it out by visiting the Onsets and Frames Colab Notebook or using the code on GitHub, and let us know!