MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) is a dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.

Dataset

We partnered with organizers of the International Piano-e-Competition for the raw data used in this dataset. During each installment of the competition virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system. Recorded MIDI data is of sufficient fidelity to allow the audition stage of the competition to be judged remotely by listening to contestant performances reproduced over the wire on another Disklavier instrument.

The dataset contains about 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).

A train/validation/test split configuration is also proposed, so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. Repertoire is mostly classical, including composers from the 17th to early 20th century.

For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset.

For an example application of the dataset, see our blog post on Wave2Midi2Wave.

Download

MAESTRO is provided as a zip file containing the MIDI and WAV files as well as metadata in CSV and JSON formats. A MIDI-only archive of the dataset is also available.

The metadata files have the following fields for every MIDI/WAV pair:

Field	Description
canonical_composer	Composer of the piece. We have attempted to standardize on a single spelling for a given name.
canonical_title	Title of the piece. Not guaranteed to be standardized to a single representation.
split	Suggested train/validation/test split.
year	Year of performance.
midi_filename	MIDI filename.
audio_filename	WAV filename.
duration	Duration in seconds, based on the MIDI file.

V3.0.0

In this update, we removed 6 erroneously included recordings that had string quartet accompaniment in addition to piano. These files occurred in both train and test splits, so this new version is not compatible with V2.0.0.

The following recordings were removed:

2018/MIDI-Unprocessed_Chamber1_MID--AUDIO_07_R3_2018_wav--2
2018/MIDI-Unprocessed_Chamber2_MID--AUDIO_09_R3_2018_wav--3
2018/MIDI-Unprocessed_Chamber3_MID--AUDIO_10_R3_2018_wav--3
2018/MIDI-Unprocessed_Chamber4_MID--AUDIO_11_R3_2018_wav--3
2018/MIDI-Unprocessed_Chamber5_MID--AUDIO_18_R3_2018_wav--2
2018/MIDI-Unprocessed_Chamber6_MID--AUDIO_20_R3_2018_wav--3

maestro-v3.0.0.zip

Size: 101GB (120GB uncompressed)
SHA256: 6680fea5be2339ea15091a249fbd70e49551246ddbd5ca50f1b2352c08c95291

maestro-v3.0.0-midi.zip

Size: 56MB (81MB uncompressed)
SHA256: 70470ee253295c8d2c71e6d9d4a815189e35c89624b76d22fce5a019d5dde12c

Metadata files as separate downloads:

Certain statistics of the dataset:

Split	Performances	Duration (hours)	Size (GB)	Notes (millions)
Train	962	159.2	96.3	5.66
Validation	137	19.4	11.8	0.64
Test	177	20.0	12.1	0.74
Total	1276	198.7	120.2	7.04

V2.0.0

In this update we added another year of competition performances and preserved sostenuto (CC 66) and una corda (CC 67) messages in MIDI files, in addition to sustain pedal (CC 64) present since V1.0.0.

Crucially, this version has a new train/validation/test split, which is not compatible with V1.0.0.

maestro-v2.0.0.zip

Size: 103GB (122GB uncompressed)
SHA256: 572c6054e8d2c7219aa4df9a29357da0f9789524c11fa38cef7d4bd8542c93f0

maestro-v2.0.0-midi.zip

Size: 57MB (85MB uncompressed)
SHA256: ec2cc9d94886c6b376db1eaa2b8ad1ce62ff9f0a28b3744782b13163295dadf3

Metadata files as separate downloads:

Certain statistics of the dataset:

Split	Performances	Duration (hours)	Size (GB)	Notes (millions)
Train	967	161.3	97.7	5.73
Validation	137	19.4	11.8	0.64
Test	178	20.5	12.4	0.76
Total	1282	201.2	121.8	7.13

V1.0.0

This is the original release of the dataset, which was used to produce all results in the MAESTRO paper.

maestro-v1.0.0.zip

Size: 87GB (103GB uncompressed)
SHA256: 97471232457147d5bffa72db8c4897166ba52afd4a64197004b806c2ec85ad27

maestro-v1.0.0-midi.zip

Size: 45MB (67MB uncompressed)
SHA256: f620f9e1eceaab8beea10617599add2e9c83234199b550382a2f603098ae7135

Metadata files:

Statistics:

Split	Performances	Compositions (approx.)	Duration (hours)	Size (GB)	Notes (millions)
Train	954	295	140.1	83.6	5.06
Validation	105	60	15.3	9.1	0.54
Test	125	75	16.9	10.1	0.57
Total	1184	430	172.3	102.8	6.18

License

The dataset is made available by Google LLC under a Creative Commons Attribution Non-Commercial Share-Alike 4.0 (CC BY-NC-SA 4.0) license.

How to Cite

If you use the MAESTRO dataset in your work, please cite the paper where it was introduced:

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang,
  Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. "Enabling
  Factorized Piano Music Modeling and Generation with the MAESTRO Dataset."
  In International Conference on Learning Representations, 2019.

You can also use the following BibTeX entry:

@inproceedings{
  hawthorne2018enabling,
  title={Enabling Factorized Piano Music Modeling and Generation with the {MAESTRO} Dataset},
  author={Curtis Hawthorne and Andriy Stasyuk and Adam Roberts and Ian Simon and Cheng-Zhi Anna Huang and Sander Dieleman and Erich Elsen and Jesse Engel and Douglas Eck},
  booktitle={International Conference on Learning Representations},
  year={2019},
  url={https://openreview.net/forum?id=r1lYRjC9F7},
}

Please also make sure to specify which version of the dataset you are using.

The MAESTRO Dataset

Contents