Audio Engine Dev (Part 5): Audio Voices, Optimisation and Localisation

Following on from the previous development post, which spoke about how we embedded a granular synthesis engine deep in Here we’re going to talk about the core design approach to our Unity3D audio framework, and look at the affordances, limitations, and design implications of the Emitter/Speaker paradigm.

Emitters and Speakers, what are they?

The Emitter/Speaker (ES) paradigm is essentially a method of decoupling an object that generates audio data (emitter), from the object that plays its audio back, its “voice” (speaker). In the previous post, I explained why we chose to develop ES, and summarised how the pieces fit together, which I recommend reading before continuing here. In this post, I’m going discuss why we bothered going through all this effort from a lower-level point of view.

An important consideration to mention before we dig in, is that applications or games that are not particularly audio-focused wouldn’t really benefit from this approach. However, if you’re interested in developing projects that require hundreds or thousands of audio-generating objects to co-exist, the emitter-speaker paradigm is extremely beneficial, if not essential.

Transcending Voices?

You might assume I’m simply being dramatic, and I wouldn’t blame you. If you’ve never really pushed the audio side of Unity3D’s capabilities, you might never have bumped into the engine’s audio voice limit. But if you have, you might find this approach to be helpful.

Unity3D can host a maximum of 4096 “virtual” voices, this is the number of audio sources that can be functionally processed by the engine at once. However, the bottom line is that the number of simultaneous voices it can actually playback by its DSP engine, it’s “real” voice limit, has an upper limit of 255 (defined in Edit > Project Settings > Audio > Real Voices), which defaults to 32.

Of course, 255 audio voices is quite generous, especially when you imagine the cacophony that would be produced from 255 audio files being played back at the same time. But in scenarios where the audio is created by many overlapping grains of audio, such as ours, you simply wouldn’t have enough voices to output every one of those grains at once. The obvious first option is to output all the grains from a single emitter through a paired speaker, however, by default, Unity3D’s “audio source” is designed to simply play an audio clip. If you want to perform real-time synthesis for an audio source, you can do so using the OnAudioFilterRead function. This is how we started building our system, until we bumped into the massive computational overhead that comes from the fact that every “audio source” component requires an audio voice.

Making things less desirable, all of the DSP (OnAudioFilterRead) code runs on a single audio thread. Because our project required the use of granular synthesis to generate a truly dynamic audio experience, we very quickly started to overload the DSP thread. We had a glass ceiling to obliterate, and fortunately for you, our research grant gave us the time to dig our heels in and come up with a solution, which was to process the audio separately from the audio thread, before passing it to the “audio source”. In essence, this is the emitter/speaker approach.

Once we had built a system to separate the audio synthesis from the lowly audio thread, we created an audio manager to monitor the positions of all emitters and speakers. The manager then dynamically allocates the pairing of emitters to speakers within range. This way, when multiple emitters are positioned closely together, they could all share the same speaker. This conserved massive amounts of DSP, and subsequently CPU, overhead, by limiting the amount of processing performed in the audio thread.

DOTS, DOTS, more DOTS

Without going into detail, this paradigmatic approach to audio synthesis and voice allocation within Unity3D perfectly lined up with a very new and exciting approach to design in Unity3D, called DOTS. The Data-Oriented Technology Stack is essentially multi-threading for Untiy3D. Processes which would normally occur on a single thread can be distributed across all of the CPU’s threads in what are called “systems”. In our case, the emitter game objects create grain entities, which are sent to have their audio data crunched in parallel by the audio DSP systems across every thread on the CPU.

Of course, this isn’t your dad’s code; using DOTS requires a huge conceptual shift from the standard Object-Oriented Programming (OOP) approach that most contemporary environments lean on. In a way, its approach can be considered similar to how shaders on a graphics card work, where each pixel is processed alone, unable to access the outside world or talk to its neighbour. In the same way, the multi-threaded nature of DOTS asks you to forget many OOP conveniences, and re-imagine how you approach audio synthesis and DSP.

But, while DOTS isn’t the easiest approach to wrap your head around, after seeing it crunch thousands of unique audio grains for the first time without our CPU batting an eyelid, we knew what we were put on this earth to do…

Here’s a demo scene of our system hosting hundreds of emitters that are being dynamically allocated to speakers, and producing several thousand grains per second in real-time. Each individual grain has its own unique sonic parameters and entire DSP chain, which is processed individually.

Future Optimisation: Dynamic emitter/speaker pairing based on human audio perception

The work described above was an excellent first step in building an extremely efficient and capable audio synthesis engine that embeds itself deeply within an immersive and interactive virtual environment. However, we can (and we will) optimise it further by considering the thresholds of human hearing.

A human’s upper ability to localise the position of a sound source, at best, is approximately 1 degree (Rouat, 2008). This is a very small arc with very close sounds, however, the further away a sound source is, the larger this arc gets, and therefore, the less accurately our ears can pinpoint its exact location. Furthermore, this 1 degree of localisation accuracy was determined in a controlled environment with the sound source directly in front of the participants, and it was found that this arc of accuracy is reduced to around 20 degrees at the left and right flanks of the participant (Rouat, 2008). So unless pinpoint localisation is the top priority for the project, it’s safe to assume that accuracy can be reduced below this upper limit for the sake of computational performance.

Using simple geometry, we can calculate the distances required to successfully reproduce sonic localisation. Consider that 1 degree at 1m would create an arc of perception around 17.4mm wide. In other words, an object 1 meter away only has to move 17.4mm before your ears notice. But at 10m, the same angle would provide an arc of 175mm. We can use this to our advantage when building the framework, to optimise our load on the CPU.

radius (r) = 10m
angle (Θ) = 1°
circumference (C) = 2π * r = 6.283 * 10m = 62.83m

arc (A)
A = Θ / 360 * C
A = 0.00278 * 62.83
A = 0.174m (upper limit of the perception of sound localisation @ 10 meters)

What this means is that, from a computational perspective, having multiple audio voices in Unity to output audio sources that exist within that same arc of perception is unnecessary. As mentioned above, this angle could be increased dramatically depending on the project’s aim and purpose. Furthermore, while the scientifically determined limit of sonic localisation suggests 1 degree, perhaps 10 degrees is totally acceptable in a fast-moving and highly dynamic virtual environment. If so, a sound source at 10m could be processed and output through an audio voice within 1.75m of it, instead of a mere 0.175.

radius (r) = 10m
angle (Θ) = 10°
circumference (C) = 62.83m

arc A = Θ / 360 * C
arc A = 0.02778 * 62.83
arc A = 1.74m

And let’s not forget that localisation increases to 20 degrees to the far left and right of the observer, so this number could be dynamically increased depending on the rotation of the observer’s head. Of course, this is speculation at present, but is a promising direction for further investigation to optimise the system.

If you’ve read this far, hopefully you’re getting a grasp on the benefits of employing a emitter/speaker system, and what kind of projects would make the most of this DSP approach within an interactive and immersive environment.

Next up, Brad will dig into DOTS and explain a little more about its criticality within the emitter/speaker paradigm.

References

Rouat, J 2008, ‘Computational auditory scene analysis: Principles, algorithms, and applications (wang, d. and brown, gj, eds.; 2006)[book review]’, IEEE Transactions on Neural Networks, vol. 19, no. 1, IEEE, p. 199.