How do you get sound in and out of a computer? There are two steps. You have to turn the sound into electricity, and then you have to turn the electricity into numbers.
Turning sound into electricity
At the physical level, a sound is a rhythmic vibration of air molecules. Your ears can detect subtle changes in the air pressure, and can reconstruct good guesses about what might be agitating the air to produce those changes. When the air pressure fluctuates in a steady sine-wave pattern, you hear a musical pitch. The faster the fluctuation, the higher the pitch. Microphones work a lot like your ears. They contain pieces of metal that vibrate in response to the vibrations of the air, generating a fluctuating electromagnetic disturbance.
Analog recording stores the fluctuating electric current. Vinyl records store the fluctuations in the undulating sides of the spiral groove. Magnetic tape stores the fluctuations in the alignment of tiny magnetic particles embedded in the plastic.
Turning electricity into numbers
The computer takes in fluctuating current and turns it into numbers.
The analog-to-digital converter in the computer’s sound card has a clock, like the one synchronizing the activities of the computer generally. At each clock pulse, the converter takes a reading of the current on the wire and finds the closest numerical value out of a finite set of choices. The quality of digital audio depends on two factors: how many different possible values the converter can assign to the current, and how many readings it takes per second. When you see a reference to sixteen-bit or twenty-four bit audio, it means that each sample is a sixteen or twenty-four digit binary number respectively, taking 2^16 or 2^24 different possible values. The more bits in each sample, the more accurate it is, so 16-bit samples are twice as accurate as 8-bit. More frequent sampling also helps to create a closer representation of the original waveform. Standard CD-quality audio is 44,100 samples per second. This sounds like an incredible speed, but CPU clocks routinely operate thousands of times faster than that.
The image shows a four-bit analog-to-digital converter.

The red line shows the amplitude of the wire’s voltage over time. The sixteen horizontal grey lines are the different voltage levels the converter can detect. It takes four bits of data to specify the sixteen different values. (Four-bit audio sounds terrible but is easier to draw.) The tick marks on the horizontal axis are clock pulses. To produce sound on speakers or headphones, the converter agitates the wires in the stairstep pattern, which your ear averages out into a pretty good reconstruction of the original sine wave.
Once you have your current stored as numbers, you can do a lot of cool stuff. Any sound in any digital medium is basically a spreadsheet with two extremely long columns, one for each stereo channel. In 16-bit audio, the numbers in the columns range from zero to 65,535 (2^16 – 1.) One second of stereo CD-quality audio is two lists of 44,100 numbers each. If the values of the numbers range smoothly along a sine wave that cycles four hundred forty times per second, you hear a computery beep playing concert A. If the numbers fluctuate along the pattern you get from superimposing the sine wave with another one that cycles six hundred sixty times per second, you hear two computery beeps a perfect fifth apart. Add in another sine wave doing eight hundred eighty cycles per second and you get the I-V-I power chord beloved by rock and roll.

All of the audio editing and processing that happens in Pro Tools and programs like it boils down to systematic mathematical operations on your lists of numbers. Auto-tune looks for sine wave patterns and alters them so they snap to the closest piano-key frequency. At the transistor level, Auto-tune is no different from Microsoft Excel, except that it acts a lot faster on bigger lists of numbers. Copying and pasting repeated sounds is the same procedure for the computer as copying and pasting a list of numbers or a string of text.
You need a fast computer with a capacious storage capacity to do serious audio work, but we’re lucky enough to live in an era when even a garden-variety laptop can handle good-sized Pro Tools sessions. We can Auto-tune Babsy’s vocals live through her laptop at Revival Revival shows with processor power to spare.
To turn the list of numbers back into sound, you need a digital synthesizer. The synth has various oscillators that produce analog current according to digital signals. Any computer’s sound card has both an analog-to-digital converter and a digital synthesizer built in. They probably perform adequately for most purposes: cell phone calls, video game music and sound effects, playing mp3s in noisy environments. However, if you want pro-quality digital sound, the computer’s built-in hardware is probably not going to do it for you. The computer is a noisy environment, with a lot of electromagnetic activity packed into a small space. This is not the ideal setting for accurate voltage readings. For professional audio purposes, you want a specialized piece of hardware located outside the computer case. I use an Mbox 2, Digidesign’s intro-level Pro Tools compatible device.
The great miracle of music for me is not any particular technique or piece or performer, but just the fact that it exists at all. A single linear wave can encode all the rich complexity of all the sounds we hear. This wave is as easily translated into numbers as dollars can be translated into pizzas. Really? The complete works of Bach, Coltrane and M.I.A. can be losslessly encoded as a two-dimensional waveform? All that music is two-dimensional curves, voltage vs time, or air pressure or guitar body flexion vs time? Apparently, yes. Cool!
Our brains are stupendously adept at detecting patterns of patterns of patterns in the linear waveform of air pressure, deconstructing and comparing the component sounds that went into it. If there are multiple frequencies present simultaneously in the pattern of vibrations, we can distinguish them and, with a little training, detect the ratios between them. I feel like we’ve barely begun to scratch the surface of the artistic possibilities of mathematical operations on numerical audio data.