Audio and Speech Generation

Audio generation has its own constraints: very long sequences, high temporal resolution, perceptual quality, speaker identity, prosody, music structure, and synchronization with text or video.

Raw Audio, Codecs, and Music

YearPaperTopicNote
2016WaveNetRaw audio autoregressionAutoregressive waveform model for speech and music.
2020Jukebox (OpenAI)Music generationMulti-scale VQ-VAE plus autoregressive Transformers for raw-audio music.
2022AudioLMAudio language modelingDiscrete audio tokens for coherent speech and music continuation.
2023MusicLMText-to-musicHierarchical sequence modeling for long-form text-conditioned music.

Text-to-Audio and Speech Systems

YearPaperTopicNote
2022AudioGenText-to-audioAutoregressive generation over learned discrete audio representations.
2023AudioLDMAudio latent diffusionText-to-audio generation using latent diffusion and CLAP conditioning.
2023VoiceboxSpeech flow matchingNon-autoregressive flow-matching model for speech generation, editing, and style transfer.

Reading Path

StepRead
1WaveNet for raw audio autoregression.
2Jukebox for long-range music generation with discrete audio codes.
3AudioLM and AudioGen for tokenized audio language modeling.
4AudioLDM for latent diffusion in audio.
5MusicLM and Voicebox for modern text-conditioned music and speech generation.