Overview
WaveNet is a neural, sample-level generative model for raw audio. It uses dilated causal convolutions to predict the next sample distribution, yielding highly natural speech and expressive prosody, although generation is expensive without acceleration.
Description
WaveNet treats a waveform as a sequence and learns p(x) as a product of conditional distributions over audio samples. A stack of dilated, causal convolutional layers with gated activations creates a very large receptive field without recurrence, so the model captures long-range dependencies in pitch, formants, and rhythm. Training uses teacher forcing with discrete mu-law or continuous likelihoods, and conditioning vectors provide linguistic features, F0, and speaker identity for TTS. At inference the network samples one value at a time, which is accurate but slow, leading to later distillation and parallel variants such as Parallel WaveNet and WaveRNN for real-time use. Compared with HMM TTS, WaveNet produces far more natural timbre and prosody, handles coarticulation gracefully, and adapts well to different speakers and styles. It also generalizes to music and sound effects because the architecture models raw audio directly rather than vocoder parameters.
About DeepMind
DeepMind is a technology company that specializes in artificial intelligence and machine learning.
Industry:
Research Services
Company Size:
501-1000
Location:
London, GB
View Company Profile