Let’s talk about stereo and mono source material in spatial audio. 
You have a virtual world. You use a spatial audio plugin to realistically place sounds in that world. Your spatializer is great at placing mono sound sources in that world. Some plugins will allow you to use a stereo file, but they combine both channels down to mono before spatializing. So why can’t you use stereo files? It’s easy to imagine placing two virtual speakers in the virtual world. Can’t I just play my stereo file into those virtual speakers and have it sound the same as stereo speakers in the real world? Everyone knows stereo is better than mono!

Let’s take a quick look at why this doesn’t quite work, and some examples.

Spoiler alert: and maybe, just maybe, sometimes it does work?


The left-right position of a sound in a stereo recording is determined by volume and phase differences across two channels of audio. Multi-channel surround sound formats like 5.1 or 7.1 create surround imaging with the same principles. Spatial audio using binaural spatialization is different. It simulates how each ear of a listener would hear a sound coming from a certain location. The direction, distance, and height of the sound’s location in a virtual world are all accounted for. This realistic simulation combines complex filtering, micro-delays, and attenuation over distance. Different spatializers add all kinds of special tricks on top of those basics, too.

Spatializers work really hard to make an audio file sound like it is coming from a specific point in space. This is basically the opposite of what makes stereo or 5.1 recordings so great. Their biggest feature is being able to place sounds *not* exactly where the speaker is. You can pan that sound anywhere along an invisible line between speakers. This mismatch of goals is what makes multi-channel audio a problematic sound source in spatial audio.

So here’s the thing: nearly every spatializer guide will say to use only mono files. A technical recommendation like that might imply a strictly technical reasoning. Mathematically, we would expect all kinds of phasing issues and general weirdness from using stereo files.


It may be, however, that your particular stereo sound would actually spatialize without any noticeable phasing issues. You can test this by splitting your stereo file into two mono files (one for left, one for right). Drop them in your scene as two sound objects and space them apart, like virtual speakers. What you may find, though, is that the limitations are sometimes more about expectations and purpose. We like stereo audio files because the added imaging gives depth and realism to complex sounds. In their own small way, they sound more immersive than mono audio. So why can’t we just make them *extra* immersive by using stereo files in our spatialized scene?


Well, it won’t quite work like that. First off, yes, there is the potential for strange phasing. But that’s boring. Let’s talk about the creative issues. When listening to stereo speakers, one thing you don’t see people do is get up and walk around to appreciate the stereo sound from different places in the room. Unfortunately, that’s exactly what happens when you spatialize your stereo sound in a world that people can move around in. If there’s a *sweet spot* when sitting in front of a stereo system, then is everything else a *bitter spot*?

Imagine the player moves directly between these two virtual speakers. When facing forward, they will hear basically normal stereo. But rotate left 90 degrees, and now the left channel is 100% in front and the right channel is 100% behind. The stereo image is now flattened in a straight line running through the listener, which ruins the depth and presence of the original, and sounds mono. It’s a bit like taking 2D artwork and dropping it flat into a 3D scene. From some angles it still mostly works, but will fall apart at others.

Listen to this example of a stereo recording of rain, spatialized as two mono objects for left and right. As the listener rotates towards these objects, the image collapses. The HRTF filtering is also more noticeable since two similar objects are being processed at opposite angles. You can still get a phantom center effect at the right angles, but it isn’t as convincing as spatializing a mono sound object right in the center of something. Let’s try this with an example of a medium-sized machine’s motor.


We have a great stereo recording for our machine motor. There are some unique elements in the left and right channels, but there’s also a phantom center. We first spatialize the left and right channels at the edges of the machine’s visuals (in this case, the white block). This mostly works, and has the nice effect of sounding like there are slightly different machine elements at work on the two sides of the object. The center presence is weak, though, especially if we move away from an ideal position. This means the heart of our machine doesn’t quite sound like it’s really *right there*.

The stereo channels that once painted a broad picture of sound now seem to shrink down to two disconnected mono emitters. In some spatializers you can improve this a little if there are options for spatial spread or volume (size, not loudness). But ultimately, it still feels like trying to force something to work that doesn’t.

If we place the mono version of this sound right in the middle of the object, now it feels really there. I admit, though, that I miss some of the width and variation from the stereo approach. If this machine is used as a non-interactive decoration on a wall, the stereo option might be nice. If the machine is a direct part of the gameplay, you will want it to be more present, so the mono option is the better choice.


Okay, so what would be the absolute best-case scenario for spatializing a stereo audio file as stereo? Well, if it’s a recording of something actually made up of multiple parts, and the left and right channels more or less equal those two parts. This is, of course, because that’s almost like having two separate mono recordings, but maybe with a little shared center material that might work to blend the sound field a bit.  

As an extreme example, I like the effect of spatializing this stereo recording of truck window washers. As I move closer to the window, the sound realistically gets wider. This wouldn’t happen with a centered mono recording. You also expect something flat like a window to have the imagine collapse to mono at the far angles. If we want to use our stereo recordings in a 3D world, they might work best in settings that are a little flatter. That may be why flat window washers might work well, but an omnipresent thunderstorm will sound thin and ineffective.

And here is an example of a truck engine revving. It’s another large object, so a single mono recording sometimes feels insufficient. The stereo recording sounds nice in some moments, and thin in others as the angle flattens the image of a sound that we think of as large and enveloping. I think this is a case where several distinct mono recordings would be best.


My takeaway from these examples leads me to two conclusions:

1. The larger the object, the more sound objects it needs.

This sounds obvious, but is worth stating. If the object in your world is smaller than your head, a single mono sound to represent it is probably all you need. If your object is a giant spaceship, you’ll want to place multiple sounds in accurate locations for engines, vents, hums, etc.If you don’t have a ton of good recordings, or don’t have the processing budget to run lots of spatialization, you can also use things like the shared Reverb sends in the DearVR Unity asset to give more size and depth to fewer sources.

2. Stereo recordings might still be useful if your object is experienced from a limited perspective similar to the original stereo image, or if the object itself is mostly two-dimensional.

As we discovered, the more you can move around a sound in your spatial world, the more you can lose the benefits of a stereo recording. If you are only ever able to face it from one direction, though, stereo files might still be useful. Similarly, if your object is mostly 2D, like a screen, window, or flat surface, a spatialized stereo recording might work for that as well.


Ambisonics are a 3D audio format as well, so do you need to build everything in your ambisonic mix from mono sources? Thankfully, no. One key difference between ambisonics and spatialized audio is how they are affected by player movement.

With a spatialized sound, the player can move all around it, or the object itself can move freely. Ambisonics are more like a skybox, and remain at a stable distance around the listener. You can never walk outside of the ambisonics sphere, and you can’t flatten or shrink the imaging. This means stereo files are perfectly fine to use as source material in ambisonic files. You’ll need to watch for phase issues in the same way that you would if you were mixing a stereo file down to mono for traditional recordings, but the playback results will be much more predictable.


In the end, spatial audio best accommodates mono source material. Stereo recordings are likely to cause issues. But like any rule, you can bend it creatively for certain situations if you understand why it’s a rule and how to take creative advantage of a limitation. Don’t be afraid to try things that shouldn’t work. Sometimes that’s how you discover a new technique.