Do Music Generation Models Encode Music Theory?

Brown University, Carnegie Mellon University
ISMIR 2024

*Indicates Equal Contribution

Abstract

Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models? More specifically, are fundamental Western music theory concepts observable within the "inner workings" of these models? Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded within these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus, we introduce SynTheory, a synthetic MIDI and audio music theory dataset, consisting of tempos, time signatures, notes, intervals, scales, chords, and chord progressions concepts. We then propose a framework to probe for these music theory concepts in music foundation models (Jukebox and MusicGen) and assess how strongly they encode these concepts within their internal representations. Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.

Overview

SynTheory Benchmark Overview

Overview of our SynTheory benchmark and our Jukebox and MusicGen probing setup. Our SynTheory benchmark consists of Rhythmic (tempos and time signatures) and Tonal (notes, intervals, scales, chords, and chord progressions) concepts. We assess whether music foundation models (Jukebox and MusicGen) encode these music theory concepts within their internal representations. For each task from the SynTheory dataset, we extract representations from the music foundation model. We pass an audio input, embodying the concept (e.g. Perfect 4th), into a pretrained foundation model. The audio codec tokenizes the audio into discrete audio tokens. Then, it passes these tokens into a decoder language model. From there, we extract the representations. We then train a probe classifier (linear or two-layer MLP) on these representations to predict the corresponding class (e.g. pitch class, intervals, or chords) for each SynTheory concept.

SynTheory: Synthetic Dataset of Music Theory Concepts

Tempo

Tempo dataset consists of 161 total integer tempos ranging from 50 BPM to 210 BPM (inclusive), 5 percussive instrument types, and 5 random start time offsets.



Time Signatures

Time signatures dataset consists of 8 time signatures, 5 percussive instrument types, 10 random start time offsets, and 3 reverb levels.

The 8 time signatures are 2/2, 2/4, 3/4, 3/8, 4/4, 6/8, 9/8, and 12/8.



Notes

Notes dataset consists of 12 pitch classes, 9 octaves, and 92 instrument types.

The 12 pitch classes are C, C#, D, D#, E, F, F#, G, G#, A, A# and B.



Intervals

Intervals dataset consists of 12 interval sizes, 12 root notes, 92 instrument types, and 3 play styles.

The 12 intervals are minor 2nd, Major 2nd, minor 3rd, Major 3rd, Perfect 4th, Tritone, Perfect 5th, minor 6th, Major 6th, minor 7th, Major 7th, and Perfect Octave.



Scales

Scales dataset consists of 7 modes, 12 root notes, 92 instrument types, and 2 play styles.

The 7 modes are Ionian, Dorian, Phrygian, Lydian, Mixolydian, Aeolian, and Locrian.



Chords

Chords dataset consists of 4 chord types, 3 inversions, 12 root notes, and 92 instrument types.

The 4 chord types are major, minor, augmented, and diminished.

The 3 inversions are root position, first inversion, and second inversion.



Chord Progressions

Chord Progressions dataset consists of 19 chord progressions, 12 root notes, and 92 instrument types. The 19 chord progressions consist of 10 chord progressions in major mode and 9 in natural minor mode.

The major mode chord progressions are (I–IV–V–I), (I–IV–vi–V), (I–V–vi–IV), (I–vi–IV–V), (ii–V–I–vi), (IV–I–V–vi), (IV–V–iii–vi), (V–IV–I–V), (V–vi–IV–I), and (vi–IV–I–V).

The natural minor mode chord progressions are (i–ii◦–v–i), (i–III–iv–i), (i–iv–v–i), (i–VI–III–VII), (i–VI–VII–i), (i–VI–VII–III), (i–VII–VI–IV), (iv–VII–i–i), and (VII–vi–VII–i).

BibTeX

@inproceedings{Wei2024-music,
        title={Do Music Generation Models Encode Music Theory?},
        author={Wei, Megan and Freeman, Michael and Donahue, Chris and Sun, Chen},
        booktitle={International Society for Music Information Retrieval},
        year={2024}
    }