This page contains some simple visual and audio demonstrations that explore the possibilities of MusicNet.
We can construct an aural representation of an aligned score-performance pair by mixing a short sine wave into the performance for each note, with the frequency indicated by the score at the time indicated by the alignment. If the alignment is correct, the sine tones will exactly overlay the original performance; if the alignment is incorrect, the mix will sound dissonant. Here are some sample excerpts of recordings, with corresponding scores and MusicNet alignments.
We can synthesize a recording using features learned from the MusicNet labels. This demo is created by splitting the original recording into 16384-sample frames at a constant stride of 16 samples. We then compute features of each frame, using a representation learned by a neural network trained for multi-label note classification. We rewrite each frame as a linear combination of the activation of each set of bottom-level weights in the network. Finally, we reconstruct a signal by summing the overlapping, re-written frames and normalizing.
Here is the original recording used to create the demo above.
Composition and Transcription
We can learn to compose by fitting a conditional distribution to the probability of note in the score at a particular time, given other notes played at the same time and surrounding (past and future notes) in the score. These ideas are explored more deeply in Hadjeres and Pachet (2016). For these demos, we fit a simple linear model to the conditional note distribution of the Bach chorales (a high-quality version of this collection of scores can be found in the Music21 package). We can compose by generating a random score and progressively refining it by Gibbs sampling using the learned conditionals:
We can extend these ideas to create a transcription model if we condition on acoustic data in addition to contextual notes in the score. We initialize our transcription with the output of a frame-based acoustic model (like the ones described in our paper). We then fix up flaws in the acoustic predictions with the same Gibbs sampling procedure we used to compose above: