IMMERSIVE AUDIO EXPLAINED: FOUNDATIONS, FORMATS, AND THEATER WORKFLOWS

01. SECTION

01. Executive Summary

With the development of technology and the progress of the times, in recent years, the concept of immersive sound has become the most popular vocabulary in the audio industry, and many similar terms have appeared, such as Dolby Atmos, omnidirectional sound, 3D audio, spatial audio, etc. Immersive sound is an advanced audio technology aimed at creating an immersive, realistic auditory experience. By simulating and reproducing the way sound propagates in the real world, it enables listeners to feel sounds coming from different directions and distances, thereby enhancing the realism of the audio.

In our actual lives, we are actually surrounded every day by sounds coming from different sound sources, different directions, and different environments, "immersed" in this most primitive and most realistic natural sound field. This sound field is exactly the ultimate restoration target that current immersive sound technology is constantly pursuing.

Steve Ellison, Director of Spatial Audio at Meyer Sound, stated: "The benefit of immersive audio is that you can create something that truly engages the audience and lets them listen in a different way."

Marc Lopez, Vice President of Marketing Americas at d&b audiotechnik, stated: "Immersive audio in AV helps achieve a broader goal, which is to maintain engagement in applications requiring special attention to the stage (such as theaters, churches, or even lecture halls)."

Compared to ordinary PA systems, immersive audio can connect listeners with the experience in a more intimate way. The auditory "sweet spot" no longer affects only a small group of people in the center of the space, but rather everyone becomes a part of the listening experience. _David Dohrmann, Director of Application Projects at L-Acoustics, stated that with immersive audio, "the results are more realistic, emotional connection is better," and "speakers disappear from perception, and visitors are immersed in the experience."

02. SECTION

02. The Concept of Immersive Sound

Although there is currently no unified standard in the industry for the definition of immersive sound and immersive sound systems, through the semantics in Chinese and English, combined with the development history of the sound discipline and current technical status, it can be considered that immersive sound refers to a sound field that can present sound information from three dimensions: left and right (horizontal), front and back (depth), and up and down (height). This term is mostly used in sound reinforcement systems, and its ultimate goal is to reconstruct or replicate the auditory perception of real sound scenarios in our daily lives through technical means.

08-02 — Figure 01 Immersive Audio System Laboratory Testing

03. SECTION

03. The Development Process of Immersive Sound

The evolution of immersive sound systems is an important manifestation of the progress of audio technology, aimed at providing a more realistic and three-dimensional auditory experience. The following are several key stages in the development of immersive sound systems:

3.1 From Monaural to Stereo

The origins of immersive sound can be traced back to the era of monaural audio. In 1876, Bell obtained the patent for the first electronic speaker, opening the history of sound recording and playback. With the development of technology, stereo gradually became popular in the mid-20th century, becoming the standard form of audio playback. Stereo provides a richer auditory experience through two channels (left and right), but it still cannot fully simulate sound positioning in the real world. Later, theaters evolved to adopt a three-channel stereo system using left, center, and right.

08-03 — Figure 02 Schematic Diagram of Mono and Stereo

3.2 The Introduction of Surround Sound

Based on stereo, surround sound technologies (such as 5.1 and 7.1 channels) became popular in the 1980s and 1990s. Surround sound systems can create a three-dimensional sound field around the listener through the layout of multiple speakers, making the sources of sound more diverse and realistic. The technological progress in this stage significantly enhanced the sound effect experience in movies and home theaters.

08-04 — Figure 03 Schematic Diagram of Surround Sound

3.3 The Introduction of the Immersive Sound Concept and Technological Development

In the 1970s, the emergence of Ambisonics technology (see appendix for details) marked another leap in audio technology. Immersive sound not only relies on the number of channels but also introduces an "object-based" audio processing method. This method allows audio engineers to treat sound as independent objects during the production process, rather than being fixed to specific audio channels. Sound, as an object, can move freely in three-dimensional space, greatly simplifying the workload of audio engineers.

08-06 — Figure 05 The Development timeline of immersive Audio

04. SECTION

04. Audio Formats of Immersive Sound

In practical system design, immersive audio formats usually fall into three broad families: channel-based, object-based, and scene-based audio. Each one has different implications for production workflow, portability, and deployment.

4.1 Channel-based Audio (CBA)

Channel-based audio extends the logic of traditional surround formats. Specific program material is assigned to specific channels, and those channels assume a matching loudspeaker layout at playback.

Strength: predictable playback when production and venue layouts match exactly.
Limitation: poor portability and limited flexibility when the installed loudspeaker geometry differs from the original mix environment.
Typical use: standardized cinema, broadcast, or other controlled playback environments.

4.2 Object-Based Audio (OBA)

Object-based audio treats a sound source as an independent object with positional metadata. The rendering engine then maps that object to the available loudspeaker array in real time.

This approach is especially powerful in theaters and immersive venues because it allows a sound image to move naturally through space without forcing the designer to think only in channel assignments.

The engineer can work from the point of view of the sound source, not the loudspeaker.
Spatial motion becomes more coherent and easier to automate.
The same creative concept can be adapted to different deployment geometries.

4.3 Scene-Based Audio (SBA)

This format is typically used in virtual reality (VR) and games, able to dynamically adjust the direction and intensity of sound according to the user's perspective and position, enhancing the sense of immersion.

Modern theater immersive sound technology adopts object-based audio relatively more, and it is also the future development trend.

05. SECTION

05. Core Technologies Behind Immersive Audio

The core of immersive sound technology lies in its ability to reconstruct the spatial characteristics of sound. When receiving sound, the human ear not only hears the loudness and pitch of the sound but also perceives the direction and distance of the sound source. This perception relies on the propagation characteristics of sound in space, including phenomena such as reflection and absorption of sound waves. Immersive sound systems simulate these spatial characteristics through multi-channel audio and complex signal processing technology, enabling listeners to feel the three-dimensionality and depth of the sound.

5.1 Real-Time Source Tracking

Taking the BlackTrax intelligent tracking system as an example, this technology can intelligently identify and position any object in a stage performance, including actors, props, projection screens, and so on. After configuring this system, each actor only needs to wear a tracking belt pack, and through multiple sensors arranged on the stage, using infrared optical positioning technology, the actors on the stage are positioned. Without any operation, BlackTrax automatically identifies their positions and sends the position data to the server via the network, obtaining the coordinate information of the sound images on the stage in real-time. The positioning accuracy error of this intelligent tracking system can reach the millimeter level, with a maximum error of about 6mm.

5.2 Spatial Rendering and Localization Algorithms

Based on the coordinate information of the sound images on the stage obtained in real-time, the sound positioning algorithm (such as the public Ambisonics algorithm, detailed later) can calculate the positional relationship between the sound source and the listener, and determine the volume and delay of each audio channel according to the layout of the live speakers. Therefore, the sound emitted by the speakers can enable the listener to accurately perceive the source and direction of the sound. This algorithm can also make the live speakers emit the sound effects of flying in the air based on virtual sound image signals (such as helicopters or birds) and their trajectories.

5.3 An Algorithm for Electronic Variable Reverberation

The electronic variable reverberation system is a system that uses long-distance pickup to adjust the reverberation time and other acoustic characteristics of an existing venue through electroacoustic means. It can, without changing the physical structure of the theater, enable a hall with a shorter reverberation time to obtain more room reflection energy of different levels and a longer reverberation time as needed, thereby meeting the demands of various performance forms for the acoustic environment. This technology requires professional audio companies to have developed it relatively maturely, and will not be repeated here.

06. SECTION

06. Classification of Theater Immersive Sound

Not every theater production requires the same depth of immersive deployment. In practice, the architecture of an immersive theater system can be understood in tiers, depending on what the production must communicate.

6.1 Frontal Immersive Systems

It usually refers to multiple sets of speakers arranged at the front of the stage (or performance area), facing the audience to provide sound, that is, the sound image positioning system for real-life performers on the stage.

If the multiple sets of frontal speakers can accurately virtualize the sounding effects of each sound source (including its source timbre, spatial positioning, etc.), and do not require the electroacoustic system to reconstruct relevant spatial acoustic environmental characteristics, then this kind of frontal system can also be called immersive sound.

Take a pop music concert held in an open outdoor venue as an example. First, the space can be approximately considered a free field, without top, lateral, and rear reflected sounds; Secondly, pop music actors all perform on the stage, and there are no performance sound sources located on the sides, rear, and top of the audience. At this time, a good frontal system can better virtually reproduce a sound scenario that is very close to our visual perception.

In general theater performances, as long as the plot does not need to heavily feature sound effects like a helicopter sound gradually approaching from a distance as in a movie, the rising and falling sounds of bird calls in a forest, and the sound of birds suddenly taking off from up close and flying away, etc., the actors all perform on the stage, and the reflective sounds provided by the auditorium can generally match the acoustic environment of the actors in a normal venue. Therefore, for general performance scenes in theaters, the frontal system also belongs to immersive sound.

6.2 Frontal System + Sound Effect System

In some theater performances, the plot needs to focus on displaying sound effects, so immersive sound needs to be equipped with at least a frontal system + sound effect or surround sound system. Such as a certain Forest Adventure, which needs to display the sound of babbling flowing water, the roars of beasts in the distance, and the effect sound of birds being startled and suddenly flapping their wings to fly away, etc.

6.3 Full Environmental Systems

In some theater performances, the plot not only needs to display sound effects but also needs to display the environmental sounds (reverberation sound field) of special places (such as a bathroom, church, or cave) that differ from general places, so immersive sound needs to be configured with a frontal system + sound effect system + environmental sound system (electronically adjustable system).

07. SECTION

07. Classification of Theater Immersive Sound

Immersive audio is attractive not because it is fashionable, but because it offers concrete operational and perceptual advantages over conventional sound reinforcement when properly deployed.

7.1 Three-dimensional sound effect experience

Immersive sound technology can accurately locate sound in three-dimensional space, enabling listeners to feel sound effects coming from different directions, creating a more realistic auditory environment. This all-around sound experience transcends traditional stereo and surround sound systems, which usually can only provide a limited sense of direction.

7.2 Better sound coverage and positioning

Immersive sound technology, by deploying wide-angle, multi-channel speaker systems, can achieve more even sound coverage over a larger area, greatly increasing the size of the optimal listening area in a theater or live venue. Traditional audio systems usually have limitations in speaker layout, while immersive sound systems can flexibly adjust the number and position of speakers to adapt to different venue requirements, thereby improving the positioning accuracy and consistency of the sound.

7.3 Flexible audio production and delivery

Immersive sound technology supports object-based audio production, which means that audio files can contain more metadata, such as spatial coordinates and acoustic characteristics. This flexibility allows audio engineers to easily adjust and optimize audio effects across different occasions and devices, whereas traditional audio systems are often constrained by fixed channel layouts.

7.4 Enhanced artistic expressiveness

Immersive sound technology provides music producers with greater creative freedom, enabling the realization of more complex spatial effects and dynamic changes in audio. The application of this technology allows musical works to better express emotions and atmosphere, enhancing artistic expressiveness.

7.5 Adaptability and Compatibility

Immersive sound systems typically have strong adaptability and can be compatible with a variety of audio formats and devices. For example, systems based on Higher Order Ambisonics can be backward compatible with traditional stereo and surround sound formats, while also being able to expand to higher channel counts to meet the needs of different occasions.

These advantages have led immersive sound technology to be increasingly valued in a variety of application scenarios such as theaters, live concerts, movies, and games.

08. SECTION

08. How to Evaluate Real-World Results

Evaluating immersive audio in practice is not a one-dimensional exercise. It requires both subjective and objective methods, because a system can measure well and still feel emotionally unconvincing, or feel exciting while hiding technical weaknesses that reduce repeatability.

8.1 Subjective Listening Tests

Subjective listening tests are one of the important methods for evaluating immersive sound effects. This method usually includes:

Listener feedback: Collect listeners' evaluations of sound quality, spatial sense, and immersion through questionnaires or interviews. Listeners can score based on clarity, naturalness, immersion, etc., helping the production team understand the audience's real experience.
Blind test experiments: Compare different audio systems (such as immersive sound versus traditional audio systems) in the same environment to eliminate preconceived biases. Participants evaluate sound quality and spatial positioning without knowing the details, ensuring the objectivity of the results.

8.2 Objective measurement metrics

In addition to subjective evaluation, objective measurement is also an important means of evaluating the effects of immersive sound technology. Commonly used objective metrics include:

Spatial audio metrics: Evaluate the performance of immersive sound systems by analyzing the directionality, distance, and positioning accuracy of sound. These metrics can be measured through microphone arrays and acoustic models, providing quantitative data on sound source positioning and loudness uniformity.
Audio Immersion Index (AII): This emerging evaluation method combines artificial intelligence technology and microphone arrays to objectively evaluate the effects of immersive audio. The AII index considers the spatial characteristics of sound and listener perception, providing a comprehensive evaluation framework.

With continuous technological advancements, immersive audio technology is developing towards higher precision and broader applications. In the future, with the improvement of computing power and the advancement of audio processing algorithms, immersive audio will be able to achieve more complex sound effect processing on smaller devices, further enhancing the user's auditory experience.

09. SECTION

09. Appendix: Ambisonics Technology

Ambisonics remains one of the clearest conceptual bridges between sound-field capture, spatial representation, and flexible playback.

9.1 Ambisonics Concepts

Ambisonics was developed in the UK in the 1970s with the support of the National Research Development Corporation, mainly completed by the teams of Michael Gerzon from the Mathematical Institute of Oxford University and Professor Peter Fellgett from the University of Reading. Its design purpose was to use their own custom technology to reproduce recordings made with dedicated "sound field" microphones or mixed 3D surround sound, and reproduce them through at least four speakers. Essentially, it can project an immersive surround sound image that contains the direction, distance, and height of the recorded sound. Although Ambisonics has a solid technical foundation and many advantages, because the system was ahead of its time, it has only achieved commercial success recently.

Ambisonics is a full-sphere surround sound format: in addition to the horizontal plane, it also covers sound sources above and below the listener. Unlike some other multi-channel surround sound formats, its transmission channels do not carry speaker signals. It is a speaker-independent representation of a sound field, called B-format, which is then decoded based on the speaker setup within the venue (at least 4 speakers). This approach allows producers to think from the perspective of sound source direction rather than speaker position, and provides producers with considerable flexibility, allowing them to freely choose the layout and number of speakers used for playback. The number of speakers is no less than 4; of course, an increased number will improve spatial resolution and expand the area where the sound field is perfectly reproduced.

9.2 Theoretical Foundation: From Mid-Side to Full Sound-Field Thinking

Mid-Side is a widely used stereo recording technique that includes a cardioid microphone pointing forward (Mid) and a figure-8 microphone pointing to the side (Side). The mid microphone captures content primarily coming from the front, and the figure-8 microphone captures sounds from the left and right sides.

Ambisonics is the 3D version of Mid-Side. Based on a figure-8 microphone pointing to the side (axis is left-right), a figure-8 microphone pointing to the ceiling (axis is top-down) and a figure-8 microphone pointing forward (axis is front-back) are added. For the center signal, an Omni microphone (so you can have a full mono signal) replaces the cardioid microphone that captures the mono component.

9.3 First-Order Ambisonics and B-Format

The traditional stereophonic format, known as "First-order Ambisonics (B-format)," uses 4 audio channels: W - omni component, as if I were capturing the scene with an omni microphone, so this channel has audio from all directions. X, Y, Z - 3 channels corresponding to 3 figure-8s, each channel pointing in a different direction: left-right; front-back; up-down. When we talk about Ambisonics, please remember there are many different versions of Ambisonics; usually, people are referring to "first-order, B-format".

The question is whether we can get an accurate 3D sound image using only 4 audio channels? Although Ambisonics reproduces a 3D sound environment well using only 4 audio channels, its spatial resolution is not very high. This means that every sound will be a bit blurry in direction. But Higher Order Ambisonics (HOA) can solve this problem.

9.4 Higher-Order Ambisonics

For Higher Order Ambisonics (HOA), we add more audio channels to increase its spatial resolution, instead of the usual 4 audio channels of first-order Ambisonics: second-order Ambisonics has a total of 9 audio channels, third-order Ambisonics has 16 channels, fourth-order Ambisonics has 25 channels, fifth-order Ambisonics has 36 channels, sixth-order Ambisonics has 49 channels, and so on.

By adding more channels, we increase the details of the sound image. These new channels will continue to carry audio signals, but with strange polar behaviors, rather than the traditional first-order figure-8 pattern. Mathematicians call these spherical harmonics, because just as sound can be broken down into its harmonics, a three-dimensional soundscape can also be broken down into its spherical harmonics.

Mathematical formula: A simple Ambisonic panner (or encoder) receives a source signal S as well as two parameters, the horizontal angle θ and the elevation angle Φ.

Because it is omnidirectional, the W channel always receives the same constant input signal regardless of the angle. Therefore, its average energy is roughly the same as the other channels, and W is attenuated by about 3 dB (to be precise, divided by the square root of 2). XYZ actually produce the polar patterns of figure-8 microphones. We set their values to θ and Φ, and multiply the result by the input signal. The result is that the volume of the input in all components is exactly the same as the volume picked up by the corresponding microphones.

This first-order truncation is only an approximation of the overall sound field. Higher orders correspond to further terms in the multipole expansion of a function on a sphere, namely spherical harmonics. In practice, higher orders require more speakers for playback, but will improve spatial resolution and expand the area where the sound field is perfectly reproduced (up to the upper limit frequency).