Hearing soundtracks in three dimensions.
Spatial audio, also known as 3D audio or object-based audio, adds a 3rd dimension (height) to a surround sound system. Usually some number of pairs of height speakers are added (2, 4, or 6 is typical). The number of height speakers is denoted as the third number in a speaker layout notation, such as 7.1.4 (which indicates 7 speakers at listener level, 1 subwoofer and 4 height speakers). The terms 3D audio, object-based sound and spatial audio all describe the same thing.
Humans hear with two ears (binaural), giving our brains the ability to determine the location of the source of each sound we hear. There are several different ways that our brains do this, and these methods all work together in the human auditory system.
The Interaural Time Difference (ITD) is the difference in a sound’s arrival time at each ear. The ITD gives our brain a strong cue as to whether the sound is coming from the left or the right. For example, if a sound arrives in your right ear first, it is coming from the right. This relative phase difference tells the listener how far to the right or left the sound is, as opposed to straight ahead or behind the listener. At frequencies above 1600 Hz, the dimensions of the head are greater than the length of the sound waves, so it becomes impossible to determine direction based on the time-of-arrival difference in each ear. Your head is smaller than half the wavelength of the sound waves below 800 Hz. For these lower frequencies, the human auditory system can easily determine phase delays between both ears. However, below 80 Hz the phase difference relative to the sound wavelength becomes too small for a directional evaluation.
The Interaural Intensity Difference (IID) or Interaural Level Difference (ILD) is the amplitude difference of a sound at each ear. For example, if a sound arrives in your right ear first, it is not only coming from the right, but it will also be slightly louder in your right ear. This difference is mainly due to head shadowing. While these differences are larger at higher sound frequencies, interaural level differences are very low below 800 Hz and especially low below about 200 Hz. Given these differences, it’s difficult to determine input direction on the basis of level differences alone.
The Pinna Filtering Effect describes how sound reflected off your pinna (the part of your ear that protrudes from your head) changes frequency and arrives at different moments based on the part of the pinna that each reflection bounced off of. The amplitude and frequency spectrum of the sound reflected off the pinna helps us localize sounds overhead.
Spatial Audio Formats
There are two commercial formats that are used both in professional theaters and on home video formats: Dolby Atmos and DTS:X. Both Dolby Atmos and DTS:X are systems that will render multiple channels in 3 dimensions.
Home video distribution started with VHS and Betamax tape, and these formats included stereo audio.
Both formats can either use lossy compression (Dolby Digital or DTS-HD), or lossless compression (Dolby TrueHD or DTS-HD MA). Lossless Dolby Atmos or DTS:X requires 8 or 10x the bit rate of lossy compression. Only 4K Blu-ray and Kaleidescape deliver lossless Dolby Atmos and DTS:X. Streaming video services use lossy compression.
To enable the apparent source of a sound to be positioned almost anywhere in a room, and not just along a 2-dimensional plane (left to right, front to back), Dolby developed Atmos. Atmos adds the concept of height speakers to a surround sound system, making it 3 dimensional. These speakers can be placed on the ceiling, or they can be added to the top of existing speakers, aimed upwards at an angle in order to reflect their sound output off of a hard, flat ceiling.
An Atmos audio mix starts with a “bed,” which is a stereo or surround sound mix with defined speaker positions. The bed mix can be 2.0, 3.0, 5.0, 5.1, 7.0, 7.1, 7.0.2, or 7.1.2. The Dolby Atmos Renderer in your AV Receiver or Audio processor will decode and render the bed mix, mapping it to your speaker configuration (which you must also configure in your AV Receiver or Audio Processor).
Atmos can encode object audio channels. Atmos allows for up to 118 discrete audio objects, which are sound sources encoded with size and position metadata. Audio object channels can change their position over time. So, for example, the sound coming from a helicopter can be perceived as flying from the front left corner of your room to the back right corner, right over your head. These audio object channels are decoded and mixed into the bed mix, with the volume of an audio object calculated for each speaker in your system in order to simulate the positioning of that object in your room.
Dolby Atmos is used for professional theaters, but the theatrical mix can support more speakers in the room. The maximum number of speakers for a consumer Dolby Atmos system is 24.x.10. Only the Trinnov Altitude 32 can render Dolby Atmos to 24 ear-level speakers and 10 height speakers. The low frequency effects channel (.x.) is always a single channel, since listeners cannot sense the direction of a very low frequency sound source.
If a room is configured with no center speaker, for 3.0 and greater bed configurations your AV Receiver or Audio Processor will try to phantom image the content in the bed Center channel to the available left and right speaker positions. Overhead speakers in a x.y.4 configuration will create a phantom center (front/back center) of the x.y.2 component, whereas in an x.y.6 overhead configuration, the x.y.2 channels in the bed will use the middle height speakers (directly over the head of the listener).
DTS:X has nearly identical capabilities to Dolby Atmos. It combines a bed mix with object audio channels that have size and position metadata that can change position over time. DTS:X can either use DTS-HD lossy compression, or DTS-HD MA (Master Audio) lossless compression. DTS:X can be rendered to a maximum of 11.2 speakers.
Setting up a Spatial Audio System
Spatial audio systems require 3 things. First, you need a media source that supports Dolby Atmos or DTS:X audio. Second, you need an AV Receiver or Audio Processor to decode and map the spatial audio to all of the channels in your audio system. Lastly, you need a speaker system with a sufficient number of speakers, including height speakers. At a minimum a Dolby Atmos capable system can render Dolby Atmos audio to as few as 2.x.1 speakers, but for reasonably good spatial accuracy you will want at least 5.x.2 speakers. A 7.1.4 speaker system will provide meaningfully better accuracy front to back and left to right than a 5.1.2 system (as it adds left and right surround speakers, and it uses 2 pairs of height speakers, with one pair in front and one in back, adding another dimension to the height speakers). Your media source will either decode the Dolby Atmos or DTS:X audio and pass it through the HDMI output to your preamplifier as decoded multichannel PCM audio (but this is typically limited to 5.1 channels), or it will bitstream the encoded Dolby Atmos or DTS:X signal to your AV Receiver or Audio Processor, which will decode and map the audio to all of the amplifier+speaker channels in your system.