The vocoder is one of those mysterious technologies that’s far more widely used than understood. Here I explain what it is, how it works, and why you should care. A casual music listener knows the vocoder best as a way to make that robot voice effect that Daft Punk uses all the time.
Here’s Huston Singletary demonstrating the vocoder in Ableton Live.
This is a nifty effect, but why should you care? For one thing, you use this technology every time you talk on your cell phone. For another, this effect gave rise to Auto-Tune, which, love it or hate it, is the defining sound of contemporary popular music. Let’s dive in!
To understand the vocoder, first you have to know a bit about how sound works. Jack Schaedler made this delightful interactive that will get you started. I wrote my own basic explanation of what sound is and how you digitize it. The takeaway is this: as things in the world vibrate, they make the air pressure fluctuate. Your ear is a very precise tool for measuring momentary changes in air pressure, and it’s able to decode these changes as sound. You can also use microphones to convert air pressure fluctuations into electrical current fluctuations, which you can then transmit, amplify, record, and so on.
In the 1930s, Bell Labs began researching ways to digitize voice transmissions. In theory, this is not all that difficult to do. You just take regular readings of the voltage coming off the microphone and store them as numbers. The problem is that you need a whole lot of numbers to do it accurately. The standard for compact disks calls for you to take 44,100 readings per second, and each reading takes up two bytes of memory. That gives you five megabytes of data per minute of audio, which is way too much to transmit even in 2017, much less with 1930s tech.
Fourier realized that you can express any sinusoidal waveform, no matter how complicated, as the sum of a bunch of simple sine waves. That’s super helpful because sine waves are easy to express and manipulate mathematically. Breaking down a waveform into its component simple sine wave components is called a Fourier transform. Here’s an example:
You can also represent sine waves as the path swept out by a clock hand going around and around. The sci-fi-sounding term for one of these clock hands is a phasor.
If you connect a bunch of phasors together, together they can draw any crazy sinusoid you want. This is a difficult idea to express verbally, but it makes more sense if you play with Jack Schaedler’s cool interactive. Click the image below.
“Phasor magnitudes” is the daunting math term for the size of the clocks. If you make a list of the phasor magnitudes, you get a very nice and compact numerical expression of your super complicated waveform. This is way more efficient and technologically tractable than trying to store a billion individual voltage readings.
This plot of the Fourier transform is called a spectrogram. Time goes from left to right. The vertical axis represents frequency, what musicians call pitch. Think of the lower frequencies as phasors spinning around more slowly, and the higher frequencies as phasors spinning around faster. The colors and height both show amplitude, also known as loudness. Warmer colors mean the phasors are bigger, and cooler colors mean the phasors are smaller.
Now, at last, you’re ready to understand what the vocoder is and how it works. The earliest version was developed by Homer Dudley, a research physicist at Bell Laboratories in New Jersey. The name is short for Voice Operated reCOrDER. Here’s a vocoder built for Kraftwerk in the 1970s:
To encode speech, the vocoder measures to see how much energy there is at each frequency, and stores the readings as a list of numbers. The more different frequency bands you measure and the narrower they are, the more accurate your encoding is going to be. This is intriguingly similar to the way your ear detects sound–your inner ear contains a row of little hairs, each of which vibrates most sensitively to a particular frequency.
Before you can play speech back on the vocoder, you need a synthesizer that can play a sound with a lot of different frequencies in it. (This is called the “carrier.”) White noise works well for this purpose, since it includes all the frequencies. The vocoder filters out the different frequencies according to the readings it took, and you get an intelligible facsimile of the original speech. The key thing to understand here is that the vocoder is not recording speech and playing it back; it’s synthesizing speech from scratch. This is how cell phones work.
Musicians very quickly realized that if you used musical sounds instead of noise as the basis for this synthesis, you could make a lot of weird and interesting stuff happen.
Here’s Herbie Hancock demonstrating the vocoder. The sound is being produced by the synth he’s playing. The synth’s sound is filtered based on readings of his voice’s frequency content.
Herbie isn’t much of a singer, but he’s one of history’s great piano and synth players. You can see why he liked the idea of being able to “sing” using his keyboard chops.
A lot of people think that Auto-Tune is a vocoder. That’s sort of true. Auto-Tune is based on the phase vocoder, which isn’t so much a “thing” as it is a computer algorithm. The phase vocoder breaks up a signal into short bursts called windows. It then does a Fourier transform on each window. Think of it as a vocoder that can change its settings every couple of milliseconds.
Enter Andy Hildebrand, a former oil industry engineer who used the phase vocoder to inadvertently transform the sound of popular music.
Hildebrand used the Fourier transform to help Exxon figure out where oil might be, via a technique called reflection seismology. You create a big sound wave in the ground, often by blowing up a bunch of dynamite. Then you measure the sound waves that get reflected back to the surface. By analyzing the sound waves, you can deduce what kinds of rock they passed through and bounced off of.
After leaving the oil industry, Hildebrand began thinking about different ways to use his signal processing expertise for musical purposes. The pop music industry had long wanted a way to correct a singer’s pitch automatically, since doing it by hand in the studio was an extremely labor-intensive and expensive process. Hildebrand figured out how to do very fast and computationally efficient phase vocoding, enabling a computer to measure the pitch of a note and resynthesize it sharper or flatter in real time. Thus was born Auto-Tune.
The idea behind Auto-Tune was to be an invisible, behind-the-scenes tool. It has a bunch of parameters you can use to adjust the amount and speed of the pitch correction, so that you can fix bum notes without changing the timbre of the singer’s voice too much. But in 1998, while working on a Cher album, two producers named Mark Taylor and Brian Rawling discovered something. If they turned Auto-Tune all the way up, it tuned Cher’s notes way too instantly and perfectly, making her voice sound blocky and robotic. Listen to the words “But I can’t break through” to hear the Cher Effect in action.
Eventually, other producers figured out how to do the Cher Effect, and that led to the sound you hear every time you turn on the radio. If you want to try Auto-Tune yourself, it’s available in simplified form in a browser-based music app called Soundtrap.
Once you can change the pitch of your voice, there’s no reason why you can’t make copies of it and change their pitch as well, thus creating effortless artificial harmony. Hear Kanye West as a robotic choir in this song by Chance The Rapper:
Like the robotic voice effects that preceded it, Auto-Tune expresses the alienation and disembodiment of technology. This makes a lot of my fellow musicians angry. But clearly, it’s speaking to the mainstream pop audience, and why not? Our lived reality is so technologically mediated and alienated, why shouldn’t our music be too?