Overview Getting Started Demonstrations Statistics In the Press About

This page contains some simple visual and audio demonstrations that explore the possibilities of MusicNet.


We can construct an aural representation of an aligned score-performance pair by mixing a short sine wave into the performance for each note, with the frequency indicated by the score at the time indicated by the alignment. If the alignment is correct, the sine tones will exactly overlay the original performance; if the alignment is incorrect, the mix will sound dissonant. Here are some sample excerpts of recordings, with corresponding scores and MusicNet alignments.

Original recording Score MusicNet alignment
Bach Piano Prelude Bach Score Bach alignment
Mozart Wind Sextet Mozart Score Mozart Alignment
Ravel String Quartet Ravel Score Ravel Alignment


We can synthesize a recording using features learned from the MusicNet labels. This demo is created by splitting the original recording into 16384-sample frames at a constant stride of 16 samples. We then compute features of each frame, using a representation learned by a neural network trained for multi-label note classification. We rewrite each frame as a linear combination of the activation of each set of bottom-level weights in the network. Finally, we reconstruct a signal by summing the overlapping, re-written frames and normalizing.

Here is the original recording used to create the demo above.

Composition and Transcription

We can learn to compose by fitting a conditional distribution to the probability of note in the score at a particular time, given other notes played at the same time and surrounding (past and future notes) in the score. These ideas are explored more deeply in Hadjeres and Pachet (2016). For these demos, we fit a simple linear model to the conditional note distribution of the Bach chorales (a high-quality version of this collection of scores can be found in the Music21 package). We can compose by generating a random score and progressively refining it by Gibbs sampling using the learned conditionals:

We can extend these ideas to create a transcription model if we condition on acoustic data in addition to contextual notes in the score. We initialize our transcription with the output of a frame-based acoustic model (like the ones described in our paper). We then fix up flaws in the acoustic predictions with the same Gibbs sampling procedure we used to compose above:

The results here are modest; our frame-based classifier isn't particularly accurate yet and the linear conditional model is too impoverished to represent sophisticated music-theoretic relationships in the score (compare our composition results above with Hadjeres and Pachet's LSTM-based models). But these results suggest that the sampling framework for transcription could be fruitful.


We can learn features from the MusicNet labels using a multilayer perceptron optimized for multi-label note classification with square loss. Here are some examples of the weights learned by this network with a receptive field of 16,384 samples. The left column is the full set of bottom-level weights; the middle column is a magnified view of the middle 2048 weights of each set; the right column is the spectrogram of each set of weights on the left. Each sets of weights is sensitive to distinct harmonic structure. The weights decay at the boundaries because distant information is less relevant to local note prediction.