Speech to song with iZotope and Ableton

A while back I saw this viral video of Amber Wagner giving a motivational speech in her car. As you can tell from the video’s title, she uses extremely NSFW language.

Beyond its inspirational value, Amber’s speech is appealingly musical. I grabbed the audio and filed it away. Then during my morning commute this week, I was making a beat using using samples of my kids splashing around in the bath. I tried out Amber’s speech on top and it fit well, so I pulled a non-sweary excerpt and looped it up. Here’s the result:

https://soundcloud.com/ethanhein/you-can-do-it

I processed Amber’s voice with iZotope Nectar and Ableton’s vocoder. I also filled out the harmony with bass sampled from “Haitian Fight Song” by Charles Mingus and piano from “Thelonious” by Thelonious Monk.

Here’s Amber’s speech as visualized with Melodyne.

Annotated Melodyne screencap

In order to turn this into a set of discrete pitches, I first used Nectar to auto-tune it for maximum pitch quantization. Nectar has a key detection function, and out of the various options it suggested, F-sharp Mixolydian sounded the best. Next, I used Ableton’s audio-to-MIDI function on the pitch-quantized audio. After much manual cleanup, I had a musical-sounding melody. When I went to notate it, I found that I had to simplify the rhythms to make them readable.

I love this approach to creating tunes, but am not sure what to call it. I’m not composing, exactly, since Amber is the one who created the underlying tune. But I’m not just transcribing, either, because I made a number of editorial choices along the way. I guess the right word would be adapting? Whatever it is, I find it intensely satisfying, both as a process and a product.

I’m generally fascinated by the continuum from ordinary speech to heightened/poetic speech to rapping to singing. (I would put Amber’s video somewhere between heightened speech and rap.) Maybe there’s a definition here: speech is almost totally unquantized in its rhythms and pitches, while singing is almost totally quantized. Rap is mostly quantized in its rhythms and mostly not quantized in its pitches. Heightened speech is like rap with freer time.