Audio and Speech Generation
Audio generation has its own constraints: very long sequences, high temporal resolution, perceptual quality, speaker identity, prosody, music structure, and synchronization with text or video.
Raw Audio, Codecs, and Music
Year Paper Topic Note 2016 WaveNet Raw audio autoregression Autoregressive waveform model for speech and music. 2020 Jukebox (OpenAI )Music generation Multi-scale VQ-VAE plus autoregressive Transformers for raw-audio music. 2022 AudioLM Audio language modeling Discrete audio tokens for coherent speech and music continuation. 2023 MusicLM Text-to-music Hierarchical sequence modeling for long-form text-conditioned music.
Text-to-Audio and Speech Systems
Year Paper Topic Note 2022 AudioGen Text-to-audio Autoregressive generation over learned discrete audio representations. 2023 AudioLDM Audio latent diffusion Text-to-audio generation using latent diffusion and CLAP conditioning. 2023 Voicebox Speech flow matching Non-autoregressive flow-matching model for speech generation, editing, and style transfer.
Reading Path
Step Read 1 WaveNet for raw audio autoregression. 2 Jukebox for long-range music generation with discrete audio codes. 3 AudioLM and AudioGen for tokenized audio language modeling. 4 AudioLDM for latent diffusion in audio. 5 MusicLM and Voicebox for modern text-conditioned music and speech generation.