Escolar Documentos
Profissional Documentos
Cultura Documentos
About this Section Digital Audio: Theory and Reality Oversampling Dither Jitter Aliasing Phase The Fourier Series (Java) The FFT (new 9/2002) An Introduction to Digital Reverb
Nigel Redmon
details. Not heavy on hard-core theory, but it's easy to follow and describes things you use (how CD players work, etc.). Advanced Digital Audio, Pohlmann (Editor), 1991 (Sams). Topics by several authors (including Pohlmann) that dig a little deeper than Principles of Digital Audio. Musical Applications of Microprocessors, Chamberlin, 1980 (Hayden). A classic. This book covers many aspects of electronic music instruments, including microprocessor control of synthesizers, and making and processing music computationally, including practical development of digital synthesizers. It's before MIDI and before practical commercial digital synthesizers, but it's great for covering many practical aspects of analog and digital audio and instruments. There's a second edition out, but I don't know what was added for it.
convertor and low-pass filter, holding the current sample level until the next. This causes a frequency droop and loss of highs--impulses carry more high-frequency energy than stairsteps. The solution is not to produce impulses--which are impossible to produce perfectly--but to simply adjust the frequency response with filtering. Fortunately, it's trivial to add this adjustment to an oversampling filter.
In this discussion, "oversampling" means oversampling on output--at the digital to analog conversion stage. There is also a technique for oversampling at the input (analog to digital) stage, but it is not nearly as interesting, and in fact is unrelated to oversampling as discussed here.
In practice, we do exactly this, following it with a phase linear digital "FIR" (finite-impulse response) filter, and a gentle and simple (and cheap) analog low-pass filter. If you buy the fact that giving ourselves more room to weed out the reflections--the alias components--solves our problems, then the only part that needs some serious explaining is...
Interpolating filters
There is more than one way to make a digital low-pass filter that will do the job. We have two basic classes of filters to choose from. One is called an IIR (infinite impulse response), which is based on feedback and is similar in principle to an analog low-pass filter. This type of filter can be very easy to construct and computationally inexpensive (few multiply-adds per sample), but has the drawback of phase shift. This is not a fatal flaw--analog filters have the same problem--but the other type of digital filter avoids the phase shift problem. (IIR filters can be made with zero relative phase shift, but it greatly increases complexity.) FIR filters are phase linear, and it's relatively easy to create any response. (In fact, you can create an FIR filter that has a response equal to a huge cathedral for impressive and accurate reverb.) The drawback (starting to get the idea that everything has a trade-off?) is that the more complex the response (steep cut-off slope, for instance), the more computation required by the filter. (And yes, unfortunately our "cathedral" would require an enormous number of computations, and in fact digital reverbs of today don't work this way.) Fortunately, we need only a gentle cut-off slope, and an FIR will handle that easily. An FIR is a simple structure--basically a tapped delay line, where the taps are multiplied by coefficients and summed for the output. The two variables are the number of taps, and the values of the coefficients. The number of taps is based on a compromise between the number of coefficients we need to produce the desired result, and the number we can tolerate (since each coefficient requires a multiplication and addition). How do we know what numbers to use to yield the desired result? Conveniently, the coefficients are equivalent to the impulse response of the filter we're trying to emulate. So, we need to fill the coefficients with the impulse response of a low-pass filter. The impulse response of a low-pass filter is described by (sine(x))/x. If you plot this function, you'll see that it's basically a sine wave that has full amplitude at time 0, and decays in both directs as it extends to positive and negative infinity.
If you've been following closely, you'll notice that we have a problem. The number of computations for an FIR filters is proportional to the number of coefficients, and here we have a function for the coefficients that is infinite. This is where the "compromise" part comes in. If we truncate the series around zero--simply throwing away "extra" coefficients at some point--we still get a low-pass filter, though not one with perfect cut-off slope (or ripple in the "stop band"). After all, the sin(x)/x function emulates a perfect low-pass filter--a brick wall. Fortunately, we don't need a perfect one, and our budget version will do. We also use some math tricks--artificially tapering the response off, even quickly, gives much better results than simply truncating. This technique is called "windowing", or multiplying by a window function. As a bonus, we can take advantage of the FIR to fix some other minor problems with the signal. For instance, Nyquist promised perfect reconstruction in an ideal mathematical world, not in our more practical electronic circuits. Besides the lack of an ideal low-pass filter that's been covered here, there's the fact we're working with a stair-step shaped output before the filter--not an ideal series of impulses. This gives a little frequency droop--a gentle roll off. We can simply superimpose a complementary response on the coefficients and fix the droop for "free". While we're at it, we can use the additional bits gained from the multiplies to help in noise shaping--moving some of the in-band noise up to the frequencies that will be ultimately out later by the low-pass filter and to frequencies the ear is less sensitive to. More cool math tricks to give us better sound!
What is dither?
To dither means to add noise to our audio signal. Yes, we add noise on purpose, and it is a good thing.
The problem
The problem results from something Nyquist didn't mention about a real-world implementation--the shortcoming of using a fixed number of bits (16, for instance) to accurately represent our sample points. The technical term for this is "finite wordlength effects". At first blush, 16 bits sounds pretty good--96 dB dynamic range, we're told. And it is pretty good--if you use all of it all of the time. We can't. We don't listen to full-amplitude ("full code") sine waves, for instance. If you adjust the recording to allow for peaks that hit the full sixteen bits, that means much of the music is recorded at a much lower volume--using fewer bits. In fact, if you think about the quietest sine wave you can play back this way, you'll realize it's one bit in amplitude--and therefore plays back as a square wave. Yikes! Talk about distortion. It's easy to see that the lower the signal levels, the higher the relative distortion. Equally disturbing, components smaller than the level of one bit simply won't be recorded at all. This is where dither comes in. If we add a little noise to the recording process... well, first, an analogy...
An analogy
Try this experiment yourself, right now. Spread your fingers and hold them up a few inches in front of one eye, and close the other. Try to read this text. Your fingers will certainly block portions of the text (the smaller the text, the more you'll be missing), making reading difficult. Wag your hand back and forth (to and fro!) quickly. You'll be able to read all of the text easily. There'll be the blur (really more of a strobe effect, due to the scanning of the monitor) of your hand in front of the text, but definitely an improvement over what we had before. The blur is analogous to the noise we add in dithering. We trade off a little added noise for a much better picture of what's underneath.
Back to audio
For audio, dithering is done by adding noise of a level less than the least-significant bit before rounding to 16 bits. The added noise has the effect of spreading the many short-term errors across the audio spectrum as broadband noise. We can make small improvements to this dithering algorithm (such as shaping the noise to areas where it's less objectionable), but the process remains simply one of adding the minimal amount of noise necessary to do the job.
An added bonus
Besides reducing the distortion of the low-level components, dither let's us hear components below the level of our least-significant bit! How? By jiggling a signal that's not large enough to cause a bit transition on its own, the added noise pushes it over the transition point for an amount statistically proportional to its actual amplitude level. Our ears and brain, skilled at separating such a signal from the background noise, does the rest. Just as we can follow a conversation in a much louder room, we can pull the weak signal out of the noise. Going back to our hand-waving analogy, you can demonstrate this principle for yourself. View a large text character (or an object around you), and view it by looking through a gap between your fingers. Close the gap so that you can see only a portion of the character in any one position. Now jiggle your hand back and forth. Even though you can't see the entire character at any one instant, your brain will average and assemble the different views to put the characters together. It may look fuzzy, but you can easily discern it.
mix multiple tracks together (which generally has an implied level scaling built in). And any form of filtering uses multiplication and requires dithering afterwards. The process of normalizing--adjust a sound file's level so that its peaks are at full level--is also a gain change and requires dithering. In fact, some people normalize a signal after every digital edit they make, mistakenly thinking they are maximizing the signal-to-noise ratio. In fact, they are doing nothing except increasing noise and distortion, since the noise level is "normalized" along with the signal and the signal has to be redithered or suffer more distortion. Don't normalize until you're done processing and wish to adjust the level to full code. Your digital audio editing software should know this and dither automatically when appropriate. One caveat is that dithering does require some computational power itself, so the software is more likely to take shortcuts when doing "real-time" processing as compared to processing a file in a non-real-time manner. So, an applications that presents you with a live on-screen mixer with live effects for real-time control of digital track mixdown is likely to skimp in this area, whereas an application that must complete its process before you can hear the result doesn't need to.
The jitters
When samples are not output at their correct time relative to other samples, we have clock jitter and the associated distortion it causes. Fortunately, the current state of the art is very good for stable clocking, so this is not a problem for CD players and other digital audio units. And since the output from the recording media (CD, or DAT, for instance) is buffered and servo-controlled, transport variations are completely isolated from the digital audio output clocking.
What is aliasing?
It's easiest to describe aliasing in terms of a visual sampling system we all know and love--movies. If you've ever watched a western and seen the wheel of a rolling wagon appear to be going backwards, you've witnessed aliasing. The movie's frame rate isn't adequate to describe the rotational frequency of the wheel, and our eyes are deceived by the misinformation! The Nyquist Theorem tells us that we can successfully sample and play back frequency components up to one-half the sampling frequency. Aliasing is the term used to describe what happens when we try to record and play back frequencies higher than one-half the sampling rate. Consider a digital audio system with a sample rate of 48 KHz, recording a steadily rising sine wave tone. At lower frequency, the tone is sampled with many points per cycle. As the tone rises in frequency, the cycles get shorter and fewer and fewer points are available to describe it. At a frequency of 24 KHz, only two sample points are available per cycle, and we are at the limit of what Nyquist says we can do. Still, those two points are adequate, in a theoretical world, to recreate the tone after conversion back to analog and low-pass filtering. But, if the tone continues to rise, the number of samples per cycle is not adequate to describe the waveform, and the inadequate description is equivalent to one describing a lower frequency tone--this is aliasing. In fact, the tone seems to reflect around the 24 KHz point. A 25 KHz tone becomes indistinguishable from a 23 KHz tone. A 30 KHz tone becomes an 18 KHz tone. In music, with its many frequencies and harmonics, aliased components mix with the real frequencies to yield a particularly obnoxious form of distortion. And there's no way to undo the damage. That's why we take steps to avoid aliasing from the beginning.
A question of phase
If you've paid attention for long enough, you've seen heated debate in online forums and letters to the editor in magazines. One side will claim that it has been proven that people can't hear the effects of phase errors in music, and the other is just as adamant that the opposite is true. Much of the confusion about phase lies with the fact that there are several facets to this issue. Narrow arguments on the subject can be much like the story of the blind men and the elephant--one believes that the animal is snake-like, while another insists that it's more like a wall. Both sides may be right, as far as their knowledge allows, but both are equally wrong because they're hampered by a limited understanding of the subject.
What is phase?
Phase is a frequency dependent time delay. If all frequencies in a sound wave (music, for instance), are delayed by the same amount as they pass through a device, we call that device "phase linear." A digital delay has this characteristic--it simply delays the sound as a whole, without altering the relationships of frequencies to each other. The human ear is insensitive to this kind of phase change of delay, as long as the delay is constant and we don't have another signal to reference it to. The audio from a CD-player is always delayed due to processing, for instance, but it has no effect on our listening enjoyment.
Relative phase
Now, even if the phase is linear (simply an overall delay), we can easily detect a phase shift if we have a reference. For instance, if you connect one of your stereo speakers up backwards, the two speakers will be 180 degrees out of phase and the signals will cancel in the air (particularly at low frequencies, where the distance between the speakers has less effect). Another obvious case is when we have a direct reference to compare to. When you delay music and mix it with the undelayed version, for instance, it's easy to hear the effect; short delays cause frequency-dependent cancelation between the two signals, while longer delays result in an obvious echo.
"group of frequencies", I mean that it's typically not a signal frequency that's shifted, or unrelated frequencies; phase shift typically "smears" an area of the music spectrum. Back to the question: Does it seem likely that we could hear the difference between an audio signal and the same signal with altered phase? The answer is... No... and ultimately Yes. No: The human ear is insensitive to a constant relative phase change in a static waveform. For instance, you cannot here the difference between a steady sawtooth wave (which contains all harmonic frequencies) and a waveform that contains the same harmonic content but with the phase of the harmonics delayed by various (but constant) amounts. The second waveform would not look like a sawtooth on an oscilloscope, but you would not be able to hear the difference. And this is true no matter how ridiculous you get with the phase shifting. Yes: Dynamically changing waveforms are a different matter. In particular, it's not only reasonable, but easy to demonstrate (at least under artificially produced conditions) that musical transients (pluck, ding, tap) can be severely damaged by phase shift. Many frequencies of short duration combine to produce a transient, and phase shift smears their time relationship. turning a "tock!" into a "thwock!". Because music is a dynamic waveform, the answer has to be "yes"--phase shift can indeed affect the sound. The second part is "how much?" Certainly, that is a tougher question. It depends on the degree or phase error, the area of the spectrum it occupies, and the music itself. Clearly we can tolerate phase shift to a degree. All forms of analog equalization--such as on mixing consoles--impart significant phase shift. It's probably wise, though, to minimize phase shift where we can.
The FFT
Background
From Fourier we know that periodic waveforms can be modeled as the sum of harmonically-related sine waves. The Fourier Transform aims to decompose a cycle of an arbitrary waveform into its sine components; the Inverse Fourier Transform goes the other way--it converts a series of sine components into the resulting waveform. These are often referred to as the "forward" (time domain to frequency domain) and "inverse" (frequency domain to time domain) transforms. For most people, the forward transform is the baffling part--it's easy enough to comprehend the idea of the inverse transform (just generate the sine waves and add them). So, we'll discuss the forward transform; however, it's interesting to note that the inverse transform is identical to the forward transform (except for scaling, depending on the implementation). You can essentially run the transform twice to convert form one form to the other and back!
The FFT
See that the result wave's peak is the same as that of the target we are testing, and its average value is half that. Here's what happens when they don't match:
In the second example, the average of the result is zero, indicating no match. The best part is that the target need not be a sine wave. If the probe matches a sine component in the target, the result's average will be non-zero, and half the component's amplitude.
In phase
The reason this works is that multiplying a sine wave by another sine wave is balanced modulation, which yields the sum and difference frequency sine waves. Any sine wave averaged over an integral number of cycles is zero. Since the Fourier transform looks for components that are whole number multiples of the waveform section it is analyzing, and that section is also presumed to be a single cycle, the sum and difference results are always integral to the period. The only case where the results of the modulation don't average to
The FFT
zero is when the two sine waves are the same frequency. In that case the difference is 0 Hz, or DC (though DC stands for Direct Current, the term is often used to describe steady-state offsets in any kind of waveform). Further, when the two waves are identical in phase, the DC value is a direct product of the multiplied sine waves. If the phases differ, the DC value is proportional to the cosine of the phase difference. That is, the value drops following the cosine curve, and is zero at pi/2 radians, where the cosine is zero. So this sine measurement doesn't work well if the probe phase is not the same as the target phase. At first it might seem that we need to probe at many phases and take the best match; this would result in the ESFT--the Extremely Slow Fourier Transform. However, if we take a second measurement, this time with a cosine wave as a probe, we get a similar result except that the cosine measurement results are exactly in phase where the sine measurement is at its worst. And when the target phase lies between the sine and cosine phase, both measurements get a partial match. Using the identity
for any theta, we can calculated the exact phase and amplitude of the target component from the sine and cosine probes. This is it! Instead of probing the target with all possible phases, we need only probe with two. This is the basis for the DFT.
Getting complex
By tradition, the sine and cosine probe results are represented by a single complex number, where the cosine component is the real part and the sine component the imaginary part. There are two good reasons to do it this way: The relationship of cosine and sine follows the same mathematical rules as do complex numbers (for instance, you add two complex numbers by summing their real and complex parts separately, as you would with sine and cosine components), and it allows us to write simpler equations. So, we refer to the resulting average of the cosine probe as the real part (Re), and the sine component as the imaginary part (Im), where a complex number is represented as Re + i*IM. To find the magnitude (which we have called "amplitude" until now--magnitude is the same as amplitude when we are only interested in a positive value--the absolute value):
The FFT
In the way we've presented the math here, this is the magnitude of the average, so again we'd have to multiply that value by two to get the peak amplitude of the component we're testing for.
You might notice that IM can be zero, which would lead to a divide-by-zero error on your computer. In that case, notice the the result of the division becomes very large for non-zero Re as IM approaches zero, and the atan for very large numbers approaches pi/2. This would tell us that the target component is approaching an exact match with the cosine phase, which we already know to be true with a near-zero imaginary part.
Making it "F"
Viewing the DFT in this way, it's easy to see where the algorithm can be optimized. First, note that all of the sine probes are zero at the start and in the middle of the record--no need to perform operations for those. Further, all the even-numbered sine probes cross zero at one-fourth increments through the record, every fourth probe at one-eighth, and so on. Note the powers of two in this pattern. The FFT works by requiring a power of two length for the transform, and splitting the the process into cascading groups of two (that's why it's sometimes called a radix-2 FFT). Similarly, there are patterns for when the sine and cosine are at 1.0, and multiplication is not needed. By exploiting these redundancies, the savings of the FFT over the DFT are huge. While the DFT needs N^2 basic operations, the FFT needs only NLog2(N). For a 1024 point FFT, that's 10,240 operations, compared to 1,048,576 for the DFT. Let's take a look at the kinds of symmetry exploited by the FFT. Here's an example showing even harmonics crossing at zero at integer multiples of pi/2 on the horizontal axis:
The FFT
Here we see that every fourth harmonic meets at 0, 1, 0, and -1, at integer multiples of pi/2:
The reason this works is that an impulse is, in its ideal form, an instantaneous sound that carries equal energy at all frequencies. What comes back, in the form of reverberation, is the room's response to that instantaneous, all-frequency burst.
In the real world, the handclap--or a popping balloon, an exploding firecracker, or the snap of an electric arc--serves as the impulse. If you digitize the resulting room response and look at it in a sound-editing program, it looks like decaying noise. After some density build-up at the beginning, it decays smoothly toward zero. In fact, smoother sounding rooms show a smoother decay. In the digital domain, it's easy to realize that each sample point of the response can be viewed as a discrete echo of the impulse. Since, ideally, the impulse is a single non-zero sample, it's not a stretch to realize that a series of samples--a sound played in the room--would be the sum of the responses of each individual sample at their respective times (this is called superposition). In other words, if we have a digitized impulse response, we can easily add that exact room characteristic to any digitized dry sound. Multiplying each point of the impulse response by the amplitude of a sample yields the room's response to that sample; we simply do that for each sample of the sound that we want to "place" into that room. This yields a bunch--as many as we have samples--of overlapping responses that we simply add together. Easy. But extremely expensive computationally. Each sample of the input is multiplied individually by each sample of the impulse response, and added to the mix. If we have n samples to process, and the impulse response is m samples long, we need to perform n+m multiplications and additions. So, if the impulse response is three seconds (a big room), and we need to process one minute of music, we need to do about 350 trillion multiplications and the same number of additions (assuming a 44.1KHz sampling rate). This may be acceptable if you want to let your computer crunch the numbers for a day before you can hear the result, but it's clearly not usable for real-time effects. Too bad, because its promising in several aspects. In particular, you can accurately mimic any room in the world if you have its impulse response, and you can easily generate your own artificial impulse responses to invent your own "rooms" (for instance, a simple decaying noise sequence gives a smooth reverb, though one with much personality). Actually, there's a way to handle this more practically. We've been talking about time-domain processing here, and the process of multiplying the two sampled signals is called "convolution." While convolution in the time domain requires many operations, the
equivalent in the frequency domain requires drastically reduced computation (convolution in the time domain is equivalent to multiplication in the frequency domain). I won't elaborate here, but you can check out Bill Gardner's article, "Efficient Convolution Without Input/Output Delay" for a promising approach. (I haven't tried his technique, but I hope to give it a shot when I have time.)
By feeding forward (inverted) as well as back, we fill in the frequency cancellations, making the system an all-pass filter. All-Pass filters give us the echoes as before, but a smoother frequency response. They have the effect of frequency-dependent delay, smearing the harmonics of the input signal and getting closer to a true reverb sound. Combinations of these comb and all-pass recirculating delays--in series, parallel, and even nested--and other elements (such as filtering in the feedback path to simulate high-frequency absorption, result in the final product.
I'll stop here, because there are many readily available texts on the subject and this is just an introduction. Personally, I found enough information for my own experiments in "Musical Applications of Microprocessors" by Hal Chamberlin, and Bill Gardner's works on the subject, available here on the web.