Digital Audio Basics
(first published in Line Up magazine)
When using "off the shelf" equipment it can be tempting to ignore what is happening inside the box. Even just a little understanding of the processes underneath the lid can help to achieve better results.
Many standards, including audio, develop for reasons that with hindsight can seem less than ideal. An early digital audio system was that used for "Sound in Syncs" distribution systems which chose twice line frequency (31.25 kHz) as a sampling rate. 32 kHz was later adopted and provided a useful bandwidth of 15 kHz, half the sampling rate being the theoretical bandwidth limit predicted by Nyquist.
To operate up to this limit requires square edged "perfect" brick wall filters which do not exist in the analogue domain and the very sharpest ones that do exist tend to have unpleasant side effects. The "perfect" filter could exist in the digital domain - but that first requires the signal to be converted from analogue however techniques that will be described later that exploit this! Similar issues also arise when converting from digital back to analogue.
44.1 kHz sampling was chosen for early digital recording systems and derived from the fact that the recording medium was video tape. The sample frequency therefore had to relate to frequencies present in the video system and both 625 and 525 line systems were likely to be used. The rate chosen comes about because in 625 line 25 Hz (PAL) systems the line frequency is 15.625 kHz and 588 out of the 625 lines are "active" for carrying video information. If three samples are recorded per line the sample rate is 15.625 x 588/625 x 3=44.1 kHz.
In 525 line 30 Hz systems the line frequency is 15.75 kHz and 490 out of the 525 lines are "active" for carrying video information. If three samples are recorded per line the sample rate is 15.75 x 490/525 x 3=44.1 kHz. Some converters were also made for use with NTSC recorders where the line frequency is 15.734 kHz giving a sample rate of only 44.056 kHz.
32 kHz is inadequate for the full audio band width often required, but 44.1/32 is not a convenient conversion ratio. 48 kHz was therefore adopted for high grade professional audio.
The sampling process generates side bands equal to the bandwidth of the audio signal on either side of the sample frequency. The illustration shows the frequency bands created when a 20 kHz audio band is sampled at 48 kHz.
A further set of frequencies are created at double the sample frequency and also at three times and so on. These upper bands are not required to allow the original signal to be decoded so they can be rejected with a low pass filter.
If an audio signal of 25 kHz is applied to the converter, the lower side band would extend to 48-25 kHz i.e. 23 kHz and the original signal and the converted signal now begin to overlap. This is called aliasing and is illustrated below.
In practice the filtering begins at 20 kHz so that any signal of 24 kHz or more is sufficiently attenuated to avoid the worst effects. To illustrate a nasty case, consider a sine wave of 18 kHz being sampled at 32 kHz. This will create signals of 32+18 kHz (50 kHz) and also 32-18 kHz (14 kHz). Inter-modulation of these can produce further frequencies and the sound can become badly damaged!
To prevent aliasing in a simple converter, the audio is passed through a low pass filter with a rapid roll off above 20 kHz. Such filters can introduce irregularities to the frequency response below 20 kHz and it is also difficult to avoid undesirable phase shifts.
If the sample rate is higher, the side bands move further away from the audio band and simpler filters with more gentle slopes can be used. Choosing a rate of twice, or any other power of two times the rate finally required allows very simple rate conversion to the 44.1 or 48 kHz finally required.
Digitising a wave form involves converting the amplitude to a numeric value. The accuracy of the sampling is determined by the number of discrete levels into which the signal is "quantised".
The 16 bit audio now in general use on compact discs provides 216=65536 discrete levels. The twenty four bit format used in some professional equipment gives 16,777,216 levels. The more levels, the greater the accuracy but to save space, this illustration shows a wave shape being quantised into only sixteen discrete levels. Four bits are therefore sufficient to describe these 16 levels. The first few samples have values of 7, 4, 2 and 1. Obviously, these quantised values can be some distance from the true analogue value and this difference represents a "quantising noise", superimposed on the analogue signal.
Quantising noise is only heard when a signal is present, but with larger numbers of bits, the amplitude of the background ambiance may be of an order to produce quite audible quantising noise. As this noise is signal related, it can be obtrusive, though the worst quantising noise is equal to just less than the size of one quantising step. Masking principles can therefore be applied and adding a white noise signal with an amplitude equal to the quantising step, i.e. the least significant bit, can be an effective mask.
Whenever the length of a digital word is changed, perhaps to record a 20 bit signal on a 16 bit DAT machine, the benefits of whatever dither may have been applied are lost. In this case, the dither that affected the 20th bit of the digital word is lost when the four least significant bits are removed so new dither, appropriate to the 16 bit format should be applied. This often does not happen when machines are given incorrect word lengths and the results can be very audible.
Analogue audio signals obviously have AC wave forms, but simple digital values just count from zero to the maximum limit. There are therefore several ways to handle the changing polarity and amplitude. Although this is very much a "within the box" process, it is useful to appreciate why certain choices are made, not least when it comes to interpreting makers' sales brochures.
The alternative conversion numbers are shown in the four bit system against the quantising levels diagram.
This is a simplest format and is the format output by many A-D converters, i.e. the chips used inside the box - not what gets to an external socket! 0000 represents the peak negative value of the audio wave form and silence is coded half way up the range of values.
This can be a problem as silence now effectively has as DC offset. If two sources of digital silence are added together using offset binary, the result is a peak level DC signal. This format requires a "DC offset" of half the peak value to be subtracted from each signal before signal processing can be carried out.
This uses 1000 for the most negative value and 0111 for the most positive. Silence is 0000 and adding signals together can be done using a simple binary addition process. This can therefore be a convenient format for signals which are to undergo any form of signal processing.
Sign and magnitude
The first bit becomes a sign bit with 0 representing positive values and 1 negative values. Increasing magnitude causes the remaining bits to increment from 0000. This form of coding is used in Nicam broadcast systems.
The quantised signal level is encoded serially with a variety of other data to create a sub-frame. Two sub-frames, one for channel one and one for channel two, make up an AES/EBU frame. Each frame begins with a preamble which is a violation of the "Manchester" or "bi-phase" coding system which is described fully in the article on Digital Audio Synchronisation. The start of a preamble is indicated by a sequence of three "1"s.
Three different preambles, X, Y and Z are defined in the AES specification. The X preamble follows the three "1"s with a 00010 pattern as shown above. The Y and Z preambles also have their own distinctive signatures.
An X preamble is used to identify a sub-frame 1 - typically a left channel and a Y to identify a sub-frame 2 - typically a right channel. The Z preamble is used every 192 frames to replace the X form which would otherwise have been sent and this indicates the start of a new block of 192 frames (i.e. 384 sub-frames).
The sub-frame is divided into thirty two time slots. The preamble occupies the first four bits (bits 0 - 3). The next twenty four can be used entirely for audio data or for four bits of auxiliary data and up to twenty bits of audio.
Information is carried in the status bits to indicate how many bits are not being used to carry audio data when this is less than the maximum of twenty or twenty four bits.
One use for the aux data is to transmit a narrow band "voice quality" channel. This is done by performing a sampling process on the voice signal at one third of the main sample rate and to only 12 bit resolution. This typically give a 16 kHz sampling rate, allowing the aux data bits of three successive "sub-frames 1"s to be used to carry the twelve data bits of the voice channel. As there are also "sub-frame 2"s available, the system has the potential to carry two similar voice grade channels.
The final four bits of each sub-frame are labelled V, U, C and P.
V is normally set to 0 to indicate that the data is valid audio data. If the channel is carrying non audio data it may be set to 1.
U is a user bit. Recording and transmission systems should pass user bits without alteration so that purpose designed decoders can extract whatever information has been placed there. Many systems will use multiple frames, possibly but not necessarily in the 192 frame block structure, to convey longer words.
C is the channel status bit. The block structure allows the status bits from 192 "sub-frame 1"s to be extracted and built into twenty four words, each of eight bit length. A second set of twenty four eight bit words is built from the other sub-frames. The AES specification defines the use of status bits.
P is a parity bit set to either "0" or "1" to ensure that over bits 4 - 31, there are an even number of "1"s.
The twenty four words are labelled as "byte 0" to "byte 23" and are most easily viewed as an information matrix. The first six bytes are used for information which can be fitted into shorter words so the bytes are subdivided into shorter words as shown.
The areas shown as "R" are reserved for future definition.
There is not space here to reproduce the entire AES status bit specification, though this can be found in the AES3 documentation. In summary the first six bytes contain information about the audio which is being carried. This includes data on the sample frequency and information on whether emphasis has been used, whether the data is two channel mono or stereo, if four bits are being used for aux data and how many bits of the remaining audio data are actually being used.
The remaining bytes can include source and destination address information as 7 bit ASCII characters and time of day and address sample code in 32 bit binary form.
The first bit of the first byte (bit a) is set to a "1" whenever the data stream is an AES/EBU professional one. The alternative SPDIF (Sony-Philips Digital Interface) has many similarities to AES/EBU including the use of BI-phase coding, preambles and block structure. An important difference is in the use of the status bits. When the first bit (bit a) is "0" the remaining data in that byte is used quite differently, being divided into six one bit words and one two bit word. Information on emphasis, two channel/stereo, etc. is contained here, along with a range of source format listings.
Bytes 4 - 23 of the channel status block have no function within the SPDIF format.
AES signal interfaces are specified in the document AES3-1992 and are carried on balanced, transformer coupled circuits, giving a useful ground isolation. The BI-phase coded signals are well above the normal audio band so small transformers, often of only a few turns are adequate.
Standard XLR3 type connectors are the norm with a 2 - 10 volt data signal on pins 2 and 3 and pin 1 normally grounded at both ends of the circuit to ensure optimum RF shielding. The circuit impedances should be 110 ohms whenever possible for correct matching, but many systems can be very forgiving! Digital audio cable of 110 ohm impedance is fine for circuits of a few hundred metres and even longer, though equalisation can sometimes be needed on very long runs.
The 48 kHz sampling rate with up to 24 bits of data as defined in the AES3-1992 standard is adequate for many purposes but new sampling concepts are being devised and some manufacturers are now offering devices with 96 kHz sampling rates and potentially 24 bits of data. Various proposals have been made to convey this data such as doubling the clock rate of the AES/EBU data stream whilst retaining the same internal structure or using two AES/EBU circuits rather than one. The issue can become further complicated by the need to support additional sampling rates such as the 18.9 kHz and 37.8 kHz used in multimedia systems.
Doubling the clock rate seems a simple option for 96 kHz operation but all devices that handle that data stream must be suitably modified if the data is to survive. This can be all right for very simple systems but may be impossible to achieve in a large broadcast installation!
Using two dual channel AES circuits to carry one stereo signal is another option and proposals have been made to use some of the "reserved" bits to carry additional status data to identify what information is carried on which channel within the group of four channels that would have existed if the two links were used conventionally.
The AES standard defined in 1992 still provides a format which is satisfactory for most of today's needs, but the time is coming when advancing technology will leave it behind. There are now systems with more bits and higher sample rates giving the potential of wider bandwidths and lower noise systems. With these comes the danger that new equipment will having to adopt non standard data formats and all the incompatibility that would bring. Hopefully the standards committees will move to prevent this happening, but we wait to see!
All material is copyright PHM © 2004.
P H M (P H Music) :
Ramsbottom : UK
tel: +44 (0)7799 621954