Figuring out: Ambisonics

Here are my notes about the world of Ambisonics. This is a new area to me so, following this blog’s phylosophy, I will try to learn by explaining. Take this as an introduction to the subject.

The basic idea

We usually think about audio formats in terms of channels. Mono and stero being the most basic and used ones. If we open up the 2D space even more, we get surround audio like 5.1 and 7.1. Finally, the last step is to use the full 3D space and that’s where Ambisonics comes in.

The more complexity and channels we have, the harder is to make systems compatible between each other. In order to solve this, Ambisonics trascends the idea of channels and uses the concept of sound fields which represent planes of audio in 3D space.

This then allows to keep the aural information in a “speaker arrangement agnostic format” that can be decoded into any amount of speakers at the time of reproduction.

M/S Format

These planes of audio are represented in a special format called B-Format. You can think of this format as a natural extension of the M/S format so let’s start with that.

To get a M/S recording, we first use a figure of eight microphone facing sideways to the source (this is the “side”). This microphone will pickup the stereo information. At the same time, we use a cardioid microphone facing the source (this is the “mid”).

Once we want to decode these signals into stereo, we just need to sum mid and side to obtain the stereo left and then sum mid and the side with polarity reversed to obtain the right side. If you think about this, you realize that the “side” signal is basically a representation of the difference between left and right.

But why would we want to record things this way? Why not just record in stereo with a X/Y technique or similar? Recording using M/S has a few advantages. Firstly, we get automatic mono compatibility since we have the mid signal which we can use without fear of the phase cancellations that would happen if we sum the channles from an X/Y format. Additionally, since we can decode the M/S recording into stereo after the fact, we can control how wide we want the resulting stereo signal to be by just adjusting the balance between mid and side during decoding.

B-Format

Amisonics takes this concept and pushes to the next dimension, making it 3D by using additional channels to represent height and depth. B-format is then built with the following channels:

  • W: Contains the sound pressure information, similar to the mid signal in M/S. This is recorded with a omnidirectinal microphone.

  • X: Contains the front minus back pressure gradient. Recorded by a figure of eight microphone.

  • Y: Contains the left minus right pressure gradient, similr to the side signal in M/S. Recorded by a figure of eight microphone.

  • Z: Contains the top minus bottom pressure gradient. Recorded by a figure of eight microphone.

Note: A-Format is how we would call the raw audio from an ambisonic recording, that is, the individual signals from each microphone, while B-Format is used when we have already combined all these signals into a unique set.

Ambisonic Orders

The top row shows the W component, while the second one shows X, Y and Z. Additional rows show higher ambisonic orders for higher resolutions.

Using the B-format described above works but comes with some drawbacks. The optimal listener position would be quite small and results won’t be very natural outside it. Also, diagonal information is not very accurate, since it has to be inferred from the boundary between planes.

A solution to these issues is to increase resolution by adding more selective directonal components which instead of using traditional polar patterns would use other specific ones resulting in a signal set that contains denser aural information.

There is really no theoretical limit in how many additional microphones we can add to improve the resolution but of course there are clear prctical limits. For example, a third order ambisonics set, would use 16 tracks so is easy to see how hard drive space and microphone placement can quickly become a problem.

Decoding B-Format

Regardless of the number of ambisonics orders we use, the important thing to keep in mind is that the resulting recording will be not channel dependent. We can build the sonic information of any point in a 3D sphere by just knowing the angles to this point.

This allows us to create virtual microphones in this 3D space with which we can then match with any number of speakers. This is very powerful because once we have an ambisonics recording we can then play it on any speaker configuration preserving as much as the spatial information as the reproduction system allows.

If the final user is using headphones, a binaural signal would the result of the decoding while the same source files can be used to decode a 3D Dolby Atmos mix in cinemas.

Nowadays, you can find a big selection of ambisonic plugins for your DAW so you can play around with B-Format files including coding and decoding them in any other mutlichannel format you can imagine.

Use in media

Ambisonics was created in the 70s but has never been used much in mainstream media. This is now changing with the advent of VR experiences where 3D audio makes a lot of sense since the user can move around the scene focusing on different areas of the soundscape.

In the area of cinematic experiences ambisonics achieves a similar result as Dolby atmos or Auro-3D but using different methods. See my article about atmos to know more about this.

Google’s Resonane Audio allows you ti use ambisonics in Fmod

Regarding videogames or just generally interactive audio, ambisonics is a great fit. You can implement B-Format files in middleware like Fmod or Wwise and also game engines like Unity. This gives you the most flexibility since the ambisonics format will be decoded in real time into whatever the user is using to reproduce audio and this decoding will react in real time to their position and direction which is particularly awesome for VR.

In closing

There is much more to learn about this so I hope i get the chance to work with Ambisonics soon. I’m sure there are many details to keep in mind once you are hands on working with this formats and will try to document what I learn on this blog as I go.

Figuring out: Shepard Tone

The Shepard Tone is an interesting audio illusion that creates the impression of an always rising or falling pitch that really doesn’t get anywhere. Despite feeling like always going up or down, is always stuck in an eternal auditory fractal. How is this possible? And could it be useful for sound design?

The Penrose Stairs is a nice visual equivalent. Depending on perspective, it looks like they are always going up or down but we are really just going in circles.

The Penrose Stairs is a nice visual equivalent. Depending on perspective, it looks like they are always going up or down but we are really just going in circles.

History & Working Principles

Roger Shepard described this idea in his 1964 paper “Circularity in Judgements of Relative Pitch”. The original concept was conceived in a musical context with pitches jumping in discrete steps, aka notes.

He basically stated that to create an apparently always rising melody, we would need to create a circular or looping pattern consisting of sets of ascending notes that are faded in and out in an specific timing.

So we would start with just one ascending scale of notes. This scale will get to the end of the instrument range pretty soon so the trick is to sneak in a new set of notes doing the same thing but fading them in slowly as we fade out the previous set.

If we do this with the proper timing, we get the feeling of an eternally ascending scale.

You can see that working in the following example:

As you can see, the key to get this effect is to use volume to mask the replacement of older octaves with newer octaves that will always start from a lower tone, giving the illusion of an ascending tone overall, despite the average pitch staying constant.

Here is the same concept written in MIDI in Pro Tools. Volume is indicated by the velocity bars below. The lower green notes are rising in volume, while the blue higher ones are fading out. The central red notes stay at the same volume. Again, the average pitch is always the same, but the dynamic changes in the 3 scales provide the illusion of eternal ascension.

Later on, Jean-Claude Risset created a continuous version, called the Shepard-Risset glissando where the pitch glides without discrete jumps, making the overall effect more convincing and seamless.

In this case, the principles stay the same, there is always a new octave gradually fading in to replace the octave that will gradually fade out. This version can be much more useful for Sound Design although to achieve it we need to use instruments or synths that are able to glide through different pitch values in a smooth way.

Risset also tried to apply to concept to rhythm, layering different versions of a beat at proportional tempos (30, 60, 120 & 240 for example) and fading them in and out to create the illusion of an always ascending rhythm. Check out this examples created by music researcher Dan Stowel, below you can see how one of them looks in the spectrogram. Notice the upwards pattern and how the different versions fade in and out.

Screen Shot 2019-07-17 at 16.59.57.png

Building our own designs

Now that we have a good idea of how these effects work, let’s see if we can get creative and build a Shepard-Risset tone that could be useful in a sound design context.

I first tried using Native Instrument’s Form since it is a sample based synthesizer where you can use any sample as a source. I used this tutorial as a starting point.

Basically, you trigger several octaves at once and use two LFOs, one to control the pitch so it’s always rising and a second to control the oscillator level so it rises and then falls. Also, I adjusted the general envelope so sounds have a long attack and release just so they blend together as they come and go. This is the result just using a sine wave:

It basically works but the overlapping is a bit noticeable. I then tweaked the timing and started to play with different sounds:

Form only gives you 30 minutes to demo the plugin so I decided to use the limited time to go for one of the most obvious applications of the Shepard tone: en engine ramping up or down. Here are some of the ones I came up with. Keep in mind that the advantage of generating these is that you have an infinite amount of acceleration and deceleration which can be very handy for later editing.

All of them are quite obvious, you can tell where the sound is re-starting and new octaves are fading in. I think you could fix this playing with the volume values (although I don’t know if an LFO is the most confortable way of doing this) or maybe using more octaves.

Lastly, this last example is interesting because on top of the Shepard effect, I was changing the length of the sample to enhance the feeling of acceleration: as the sample gets shorter, the engine feels to be going faster. I tried to play around with the plugin, kind of driving in real time. This could also have interesting applications for video-games.

After this, my demo expired and I felt I didn’t have enough time to improve the effect and play around with the settings. So I looked for an alternative and after some failed experiments I found “Endless Series”, an specialized Shepard plugin by Oli Larkin.

It offers two synthesiser modes plus four audio processing modes so you can create Shepard tones from scratch or using an audio sample as a base.

There are also a nice amount of variables you can tweak to customise the result. So let’s hear some of the tones I got from this plugin.

First, here are some examples just using the synthesizer built into the plugin. You can create a discrete or a continuous (glissando) tone (Example 1). In the case of a discrete or stepped tone, you can use several different musical scales. A chromatic scale will give you the classical Shepard feel (Example 2) but you can also play with other more exotic ones. In example 3 below I tried creating a dreamy, impressionist whole tone one. Is cool, but that last one doesn’t have much of the going up in pitch feeling.

The plugin also works nicely if you want to create engines. Here are a couple of examples.

As for the audio processing mode, there are different effects that you can apply. The simplest mode, am input, just applies the Shepard processing to the sample as far as I can tell. It works in an strange way with tonal content, the Shepard effect is not very pronounced and it adds a descending tone for some reason which doesn’t help. Here is this mode applied to just a sine wave:

This same mode goes nuts with more noisier content. Here is another example using an engine sound. As you can hear, the am input mode introduces a lot of noise and artefacts. I tried playing around with the settings and using other source material but I could not make it sound clean. I don’t know if I’m missing something.

But there are two other modes that can give better results. There is a flanger and a phaser setting. As you can hear, they sound much cleaner, although the effect is quite mild in the case of the phaser. I just wish there was a way to have a “sheparded” sound as clean as this but without the flanger effect on top.

In summary, I feel that I didn’t find the perfect “Shepard Machine” but I’m sure that there are other options out there. I was also thinking that there is probably no plugin that can do everything perfectly (like sample based and synth based and musical options, etc…) so maybe an array of different plugins may be needed for different purposes.

Use in media

Shepard tones have been used in several music and film projects, sometimes in a subtle way, other times quite explicitly.

In music, they can give a very trippy and psychedelic feel (see Pink Floyd’s “Echoes” below) or they can be used to create rising tension (Used extensively in movies like “Dunkirk” or “Flight”). As some of my examples from before showed, they can also be used to create fantasy or sci-fi vehicle engines. In “The Dark Knight”, Nolan wanted the batpod to feel like an unstoppable force that doesn’t even shift gears which sounds like a perfect use case for the Shepard tone.

But probably my favourite example, and maybe this is just nostalgia, is in Super Mario 64, which features an endless staircase that you need to overcome to get to the final boss. The game gives you the illusion of an eternal ascend but you are just running on a “virtual treadmill” and getting nowhere. Analogously, the music is using a Shepard tone to achieve the same effect, an apparent ascension that is really just circular. A great example of a Shepard tone used in an interactive environment.

Figuring out: Audio Pull up/down

When working with video, an audio pull up or pull down is needed when there´s been a change in the picture´s frame rate and you need to tweak the audio to make sure it stays in sync.

This subject is somehow always surrounded by a layer of mysticism and confusion so this is my attempt of going through the basics and hopefuly get some clarity.

Audio Sampling Rate

First, we need to understand some basic digital audio concepts. Feel free to skip this if you have it fresh.

Whenever we are converting an audio signal from analogue to digital, all we are doing is checking where the waveform is at certain “points” in its oscilation. These “points” are usually called samples.

In order to get a faithful signal, we need to sample our waveforms many times. The number of times we do this per second is what determines sampling rate and is measured in Hertzs.

Keep in mind that if our sampling rate is not fast enough, we won´t be able to “capture” the higher frequencies since these would fluctuate faster than we can measure. So how fast do we need to be for accurate results?

The Nyquist-Shannon sampling theorem gives us a very good estimation. It basically says that we need about twice the sampling rate of the highest frequency we want to capture. Since the highest frequency humans can hear is around 20Khz, a sampling rate of 40Khz should suffice. Once we know this, let´s see the most comonly used sampling rates:

Sampling Rate Use
8 KhZ Telephones, Walkie-Talkies
22 Khz Low quality digital audio
44.1 Khz CD quality, the music standard.
48 KHz The standard for professional video.
96 Khz DVD & Blu-ray audio
192 Khz DVD & Blu-ray audio.
This is usually the highest sampling rate for professional use.

As you can see, most professional formats use a sampling rate higher than 40 Khz to guarantee that we capture the full frequency spectrum. Something that is important to remember and that will become relevant later on is that a piece of audio is always going to be the same lenght as long as it is played at the same sample rate that it was recorded.

For the sake of completion, I just want to mention audo resolution (or bit depth) briefly. This is the other parameter that we need to take into consideration when converting to digial audio. It measures hoy many bits we use to encode the information of each of our samples. Higher values will give us more dynamic range, since a bigger range of intensity values will be captured. This doesn´t really affect the pull up/down process.

Frames per second in video

Let´s now jump to the realm of video. There´s a lot to be said on the subject of frame rate but I will just keep it short. This value is simply how many pictures per second are put together to create our film or video. 24 frames per second (or just fps) is the standard for cinema, while TV uses 25 fps in europe (PAL) and 29.97 fps in the US (NTSC).

Keep in mind that these frame rates are different not only on a technical level but also on a stylistic level. 24 fps “feels” cinematic and “premium” while sometimes the higher frame rates used in TV feel “cheap”. This is probably a cultural perception and is definitely changing. Videogames, which many times use high frame rates like 60 fps and beyond, are partially responsible for this taste shift. The amount of motion is also very important, higher fps will be the best at showing fast motions.

But how can these different frame rates affect audio sync? The problem usually starts when a project is filmed at a certain rate and then converted to a different one for distribution. This would happen if, for example, a movie (24 fps) is brought into european TV (25 fps) or an american TV programme (29.97 fps) is brought into India, which uses PAL (25 fps).

Let´s see how this kind of conversion is done.

Sampling Rate vs Frame Rate

Some people think that audio can be set to be recorded at a certain frame rate the same way it can be set to be recorded at a certain sampling frequency. This is not true. Audio doesn´t intrinsically have a frame rate value the same way it has a bit depth and sampling rate.

If I give you an audio file and nothing else, you could easily figure out the bit depth and sampling rate but you would have no idea about the frame rate used on the associated video. Now, and here comes the nuanced but important point, any audio recorded at the same time with video will sync with the specific frame rate used when recording that video. They will sync because they were reocrded together. They will sync because what the camera registered as a second of video was also a second of audio in the sound recorder. Of course, machines are not perfect and their clocks may measure a second slightly different and that’s why we connect them via timecode but that’s another story.

This session is set at 24 fps, so each second is divided into 24 frames.

Maybe this confussion comes from the fact that when you create a new session or project in your DAW, you basically set three things: sampling rate, bit depth and frame rate. So it feels like the audio that is going to be inside is going to have those three intrinsic values. But that is not the case with frame rate. In the context of the session, frame rate is only telling your DAW how to divide a second. Into 24 slices? That would be 24 fps. Into 60 slices? That´s 60 fps.

In this manner, when you bring your video into your DAW, the video´s burnt in timecode and your DAW’s timecode will be perfectly in sync but all of this will change nothing about the duration or quality of the audio within the session.

So, in summary, an audio file only has an associated frame rate in the context of the video it was recorded with or to but this is not an intrinsic charactheristic of this audio file and cannot be determined without the corresponding video.

Changing Frame Rate

A frame rate change is usually needed when the medium (cinema, TV, digital…) or the region changes. There are two basic ways of doing this. One of them is able to do it without changing the final duration of the film, usually by re-distributing, duplicating or deleting frames to accomodate the new frame rate. I won’t go into details on these methods partly because they are quite complex but mostly because if the lenght of the final picture is not changed, we don´t need to do anything to the audio. It will be in sync anyway.

Think about this for a second. We have changed the frame rate of the video but, as long as the final leght is the same, our audio is still in sync which kind of shows you that audio has no intrinsic frame rate value. Disclaimer: This will be true as long as the audio and film are kept separated. If audio and picture are on the same celluloid and then you start moving frames around, obviously you are going to mess up the audio but in our current digital age we don’t need to worry about this.

The second method is the one that concern us. This is, when the lenght of the picture is actually changed. This happens because this is the easiest way to fix the frame rate difference, specially if it is not very big.

Telecine. How video frame rate affects audio.

Let´s use the Telecine case as an example. Telecine is the process of transfering a old fashion analogue film into video. This is not always the case but this usually also implies a change in frame rate. As we saw earlier, films are traditionally shot at 24 fps. If we want to broadcast this film in european television, which uses the PAL system at 25 fps, we would need to go from 24 to 25 fps.

The easiest way to do this is just play the original film 4% faster. The pictures will look faster and the movie will finish earlier but the difference would be tolerable. Also, if you can show the same movie in less time in TV that gives you more time for commercials, so win, win.

What are the drawbacks? First, showing the pictures a 4% faster may be tolerable but is not ideal and can be noticeable in quick action sequences. Second and more importantly, now our audio will be out of sync. We can always fix this by also playing the audio a 4% faster (and this would traditionally be the case since audio and picture were embed in the same film) but in this case, the pitch will be increased by 0.68 semitones.

In the digital realm, we can achieve this by simply playing the audio at a different rate that was recorded. This would be the digital equivalent to just cranking the projector faster. Remember before when I said that an audio file will always be the same leght if it is played at the same saple rate as recorded? This is when this becomes relevant. As you can see below, if we play a 48 KHz file at 50 KHz, we would get the same speed up effect that a change from 24 to 25 fps provides.

This would solve our sync problems, but as we were saying, it would increase the final pitch of the audio by about 0.68 semitones.

That increase in pitch may sound small but can be quite noticeable, specially in dialogue musical sections. So how do we solve this? For many years the simple answer was nothing. Just leave it as it is. But nowadays we are able to re-pitch the resulting audio so it matches its original sound or, alternativaly, we can directly change the lenght of the audio file without affecting the pitch. More on tese methods later but first let’s see what happens if, instead of doing a reasonable jump from film to PAL, we need to go from film to NTSC.

Bigger frame rate jumps, bigger problems (but not for us).

If a jump from 24 to 25 is a 4% change, a jump between 24 to 29.976 would be a whooping 24.9%. That´s way too much and it would be very noticeable. Let´s not even think about the audio, everybody would sound as a chipmunk. So how is this accomplished? The method used is what is called a “2:3 pulldown”.

Now, this method is quite involved so I’m not going to explain the whole thing here but let’s see the basics and how it will affect our audio. First let´s start with 30 fps as this was the original frame rate for TV in NTSC. This makes sense because the electrical grid works at 60 Hz in the states. But as people who, for some reason, are happy living this way, things were bound to get messy and after color TV was introduced and for reasons you can see here, the frame had to be dropped by a 1/1000th to 29.976.

A 2:3 pulldown uses the proportion of frames and the interlaced nature of the resulting video to make 4 frames fit into 5. This is because a 24/30 proportion would be equal to a 4/5 proportion. Again, this is complex and goes beyond the scope of this article but if you want more details this video can help.

But wait, we don’t want to end up with 30 frames, we need 29.97 and this is why the first step we do is slow down the film from 24 fps to 23.976. This difference is impossible to detect but crucial to make our calculations work. Once this is done, we can do the actual pulldown which doesn´t change further the lenght of the film, it only re-arranges the frames.

What does this all mean for us, audio people? It means that we only need to worry about that initial change from 24 to 23.976 which would just be a 0.1 % change. That’s small but it will still throw your audio out of sync during the lenght of a movie. So we just need to adjust the speed in the same way we do for the 4% change. If you look again at the picture above, you’ll see that that 0.1% is the change we need to use to go from film to NTSC.

As for the change in pitch, it will be very small but we can still correct it if we need with the methods I show you below. But before that, here is a table for your convenience with all the usual frame changes and the associated audio change that would be needed.

Frame Rate Change Audio Speed Change Pitch Correction (If needed)
Film to PAL 4% Up 4% Down // 96% // -0.71 Semitones
Film to NTSC 0.1% Down 0.1% Up // 100.1% // + 0.02 Semitones
PAL to Film 4% Down 4% Up // 104% // +0.68 Semitones
PAL to NTSC 4.1% Down 4.1% Up //104.1% // +0.68 Semitones
NTSC to Film 0.1% Up 0.1% Down // 99.9% // -0.02 Semitones
NTSC to PAL 4.1% Up 4.1% Down // 95.9% // -0.89 Semitones

Techniques & Plugins

There are two basic methods to do a pull up or pull down. The first involves two steps: first changing the duration of the file while affecting its pitch (using a different sample rate as explained before) and secondly applying pitch correction to match the original’s tone. The way to actually do the first step depends on your DAW but in Pro Tools, for example, you’ll see that when importing audio you have the option to apply SRC (Sample Rate Conversion) to the file as pictured above.

The second method is simply doing all at once with a plugin capable of changing the lenght of an audio file without affecting its pitch.

Also, keep in mind that these techniques can be applied to not only the stereo or the surround final mix file but also the whole session itself, which would give you much more flexibility to adjust your mix on this new version. This makes sense because a 4% change in speed could be enough to put two short sounds too close together and/or the feel of the mix could be a bit different. Personally, I have only used this “whole session” technique with shorter material like commercials. Here is a nice blog post that goes into detail about how to accomplish this.

As for changing a mixed file as a whole, wether you use a one step or two steps method, you will probably find that is easy to introduce glitches, clicks and pops in the mix. Sometimes you get dialogue that sounds metallic. Phase is also an issue, since the time/pitch is not always consistent between channels.

The thing is, time/pitch shift is not a easy thing to accomplish. Some plugins offer different algorithms to choose from depending on the type of material you have. These are designed with music in mind, not dialogue, so “Polyphonic” is the one that is usually the best option for whole mixes. Another trick you can use is to bounce your mix into stems: music, dialogue, FX, ambiences, etc and then apply the shift to each of them indepentdently, applying the best plugin and algorithm to each. This can be very time consuming but will probably give you the best results.

As you can see, this whole process is kind of tricky, particularly the pitch shift step and this is why in some occassions the audio is corrected for sync but left at the wrong pitch. Nevertheless, nowadays we have better shifting plugins to do the job. Here are some of the most commonly used, although remember that non of these works perfect in every ocassion:

-Zplane Elastique: This is in my opinion the best plugin and the one I personally use. It produces the least artefacts, keeps phase coherent and works great on whole mixes, even with single step processing.
-Pro Tools Pitch Shift: This is the stock time/pitch plugin that comes with Pro Tools. It is quite fast but is prone to create artifacts.
-Pro Tools X-Form: This one is more advanced (comes blunded with Pro Tools Ultimate) but it still suffers from some issues like giving dialogue a metallic tone or mesing the phase on stereo and surround. Also, it is slow. Veeeery slow.
-Serato Pitch n Time: I haven’t tried this either but I had to mention it since it is very commonly used and people swear by it.
-Izotope Time & Pitch: It can work well sometimes and offers many customizable settings that you can adjust to avoid artefacts.
-Waves Sound Shifter: Haven´t used it but it’s another option that seems to work well for some applications.

Which one should you choose? There is no clear answer, you will need to experiment with some of them to see what works for each project. Here is a good article and video comparing some of them.

Conclusions

I hope you now have somehow a better understanding on this messy subject. It is tricky from both a theoretical and practical level but I believe is worth figuring out where things come from instead of just doing what others do without really knowing why. Here are some takeaways:

  • Sampling rate and bit depth are intrinsic to an audio file.

  • At the same time, an audio file can be associated to a certain video frame rate when they are both in sync.

  • The frame rate change process is different depending on the magnitud of the change.

  • An audio pull up or pull down is needed when there is a frame rate chenge on the picture that affects its lenght.

  • The pull up/down can be done in two steps: lenght change first, then pitch correction or ir can be done in a single step.

  • Time/Pitch Shift is a complicated process that can produce artefacts, metallic timbres and phase issues.

  • Mixes can be processed by stems or even as whole sessions for more flexibility.

  • Try different plugins and algorithms to improve results.

Thanks for reading!

Figuring out: Measuring Loudness

How loud is too loud?

There are many loudness standards nowadays and many types of media and platforms so making sure audio is on the correct level everywhere can be tricky. In this post, I’m going to talk about the history of measuring loudness and the current standards that we use nowadays.

The analogue days

The first step to measure loudness is to define and understand the fundamental nature of the decibel. Luckily, I wrote a post last year about this very subject so you may want to check that before diving into loudness.

So, now that you are accustomed with the dB, let’s think about how we can best use it to measure how loud audio signals are.

In the analogue days, reading audio levels always implied measuring voltage or power in a signal and comparing it to a reference value. When trying to determine how loud an audio signal is, we can just measure these values across time but the problem is that levels are usually changing constantly. So how do we best represent the overall level?

A possible approach would be to just measure the highest value. This method of measuring loudness is called Peak and is handy when we want to make sure we are not working with levels above the system capacity to make sure our signals are not saturated. But in terms of measuring the general level of a piece of audio, this approach can be very deceiving. For example, a very quiet signal with a sudden loud transient would register as loud despite being quiet as a whole.

As you are probably thinking, a much better method would be to measure an average value across a certain time window instead of the instant reading that peak meters provide. This is usually called RMS (root mean square) metering and it is much closer to how we humans perceive loudness.

Let’s have a look at some of the meters that were created:

Real audio signal (grey) and how a VU meter would interpret it. (black)

VU (Volume Unit) meters are probably the most used meters in analogue equipment. They were designed in the 1940s to measure voltage with a response time similar to how we naturally hear. The method is surprisingly simple: the needle’s own weight slows down its movement by around 300 ms on both the attack and the release so very sudden changes would be soften. The time that the meter needs to start moving is usually called the integration time. You will also hear the term “ballistics” to define these response times.

The PPM (peak programme meter) is a different type of meter that was widely used in the UK and Scandinavia since the 1930s. Unlike the the VU meter, PPM uses very short attack integration times (around 10ms for type II and 4ms for type I) while using relatively long times for the release (around 1.5 seconds for a 20dB fall). Since these integration times are very short, they were often consider quasi-peak meters. The long release time helped engineers see peaks for a longer time and get a feel of the overall levels of a programme since levels would fall slowly after a loud section.

The Dorrough Loudness Meter is also worth mentioning. It combines a RMS and a peak meter in one unit and was very common in the 90s. We will see that combining a RMS and peak meter in a single unit was going to be a trend that will carry on until today.

VU meter.

PPM

The dawn of Digital Audio

As digital audio started to become the new industry standard, new ways to measure audio levels needed to be adopted. But how do we define how much 0 is in the digital realm? In analogue audio, the value we assign to 0 is usually some meaningful measure that help us avoid saturating the audio chain. These values used to be measured in volts or watts and would vary depending on the context and type of gear. For example, for studio equipment in the US, 0VU corresponds with +4 dBu (1.228 V) while europe’s 0VU is +6 dBu (1.55 V). Consumer equipment uses -10dBV (0.3162V) as their 0VU. As you can see, the meaning of 0VU is very context dependant.

In the case of digital audio, 0dB is simply defined as the loudest level that flows through the converters before clipping, this is, before the waveform is deformed and saturation is introduced. We call this definition of the decibel, dBFS (Decibel Full Scale). How digital audio levels correspond with analogue levels depends on how your converters are calibrated but usually 0VU is equated to around -20dBFS on studio equipment.

Fletcher-Munson curves showing frequency sensitivity for humans. How cool would it be to see the equivalent curves for other animals, like bats?

Fletcher-Munson curves showing frequency sensitivity for humans. How cool would it be to see the equivalent curves for other animals, like bats?

The platonic loudness standard

Since dBFS is only a scale in the digital world, we still need to find a way to measure loudness in a human friendly way within digital audio. As we have seen, this is usually accomplished by averaging audio levels across a certain time window. On the other hand, digital audio also needs precision when measuring peaks if we want to avoid saturation when converting audio between analogue and digital and viceversa.

Something else that we need to take into consideration for our standard is the fact that we are not sensitive to all frequencies in the same proportion as the Fletcher–Munson curves show. As you can see, we are not very sensitive to low or very high frequencies, if we want our audio levels to be accurate, this is something that needs to be accounted for.

So, I have laid out everything that we need our loudness standard to have. Does such thing exist?

2011-ITU-logo-official.png

The ITU BS.1770 standard

This document was presented by the ITU (International Telecommunications Union) in 2006 and fits all the required criteria we were looking for. The ITU BS.1770 is really a collection of technologies and protocols designed to measure loudness accurately in a digital environment. It is really a set recommendations, we could say.

Four revisions have been released at the time of this writing plus the ITU BS.1771 which also expands on the same ideas. For simplicity, I will refer to all of these documents as simply the ITU BS.1770 or just ITU.

The loudness unit defined by the ITU is the LKFS, which stands for “Loudness K-weighted Full scale”. This unit combines a weighting curve (named “K”) to account for frequence sensitivity along with an averaged or RMS measurement that uses a 400 ms time window. The ITU also defines a “true peak” meter as a peak meter that uses oversampling for greater accuracy.

Once the ITU released their recommendations, each region used it as the foundation for their own standards. As the ITU released new updates each region would incorporate some of these ideas while expanding on them. Let´s see some regional standards.

EBU logo 2012.png

EBU R128, Time Windows & Gates

This is the standard in use in Europe and it is released by the EBU (European Broadcast Union).

Before I continue, a clarification. The EBU names the loudness unit LUFS (Loudness units relative to full scale) instead of LKFS as the former complies better with scientific naming conventions. So if you see LUFS, keep in mind that this is pretty much the same as LKFS. On the other hand you will also see LU (Loudness Units). This is simply a relative unit that is used when comparing two LUFS or two LKFS values.

In the R128 standard, four different times windows are defined. This is based on the ITU BS.1771 recommendation. A meter needs to have all these plus some other features (see below) to be considered capable of operating in “EBU Mode”.

  • True-Peak: Almost instantaneous window with sub-sample accuracy.

  • Momentary: 400 ms window. Useful to get an idea of how loud a particular sound is. Plugins usually offer different scale options.

  • Short Term: 3 seconds window. Gives a good feel of how loud a particular section is.

  • Integrated or Programme:. Indicates how loud the whole programme is in its whole length. Sometimes it’s also called “Long Term”

Why so many different time windows? In my opinion, they are useful when working on a mix since they tell you information at different levels of resolution. True-peak tells you wether you would saturate the converters and it is good practice to always keep some headroom here. The momentary measurement is more or less similar to what VU meters would indicate, and gives you information on a particular short section. I personally don’t really look at the momentary meter much because any mix with a decent amount of dynamic range is going to fluctuate here quite a bit. Nevertheless it is useful to make sure that the mix is not very far away from the target levels on some specific sections.

Short term maybe a better tool to get a solid feel of how loud a scene is. This measurement is going to fluctuate but not as much as the momentary value. In order to get a mix within the standards, you need to make sure the short term value is usually around the target level, but you don´t need to be super accurate with this. What I try to do is make a compromise between the level that feels right and my target level and when in doubt, I favor what it feels right.

Finally, the integrated or long term value has a time window with the size of the whole show. This is the value that is going to tell you the overall level and measuring it in a faithful way is tricky as you will see below.

So, I was mentioning “target levels”. Which levels? The EBU standard recommends audio to be at -23 LUFS ±0.5 LU (±1 LU for live programmes). We are talking here about the integrated measurement, so the level for the entire show. Additionally, the maximum true peak value allowed is -1 dBTP. And that would be pretty much it, although there is one more issue as I was saying. Measuring levels throughout a long length of time in a consistent way comes with some challenges.

This is because there is usually a main element that we want to make sure is always easy to hear (usually dialogue or narration) and since audio volume is logarithmic, that main element would pretty much carry 90% of the show’s loudness weight. So we would naturally mix this element to already be at the desired loudness or slightly below. The problem comes when considering all the other elements around the dialogue. If there are too many quiet moments, that it’s going to make our integrated levels quite low, since everything is averaged.

The solution would be to either push the level of the whole show or re-mix the level of the dialogue louder so the integrated value is correct. Either way that would probably make the dialogue too loud and we would also risk saturating the peak meter. Not ideal.

Nugen´s VisLM Plugin operating in EBU mode. You can see all the common EBU features including all time windows, loudness range and a gate indicator.

In order to fix this the R128 uses the recommendations from the revisioned ITU BS.1770-3. Integrated loudness is calculated using a relative gate method that effectively pauses the measurement when levels drop below a threshold of -10 LU relative to an un-gated measurement. There is also an absolute gate at -70 LUFS, nothing below this value would be consider for the measurement. These gates help us getting a more meaningful result since only the relevant audio in the foreground will be considered when measuring the integrated time.

The last concept I wanted to mention is loudness range or LRA. This is measured in LU and indicates how much the overall levels change throughout the programme, in a macroscopic view. You can think of this as an indication of the dynamic range of your mix: low values would indicate that the mix has a very constant level while higher values would appear when there is a larger difference between quiet and loud moments. The EBU doesn’t recommend any given target value for the loudness range since this would depend on the nature of the show but it is for sure a nice tool to have to get an idea of your overall mix dynamics.

atsc_large.png

ATSC A/85

This is the standard used in the US and is released by the ATSC (Advanced Television Systems Commitee). It uses LFKS units (remember that LKFS and LUFS are virtually equivalent) and similar time windows to the europeans. The recommended integrated value is -24 LKFS while the maximum peak value allowed is -2 dBTP.

When the first version was released in 2009, this standard recommended a different method when when calculating the integrated value. As you know, the EBU system uses a relative gate in order to only consider foreground audio for its measurements but the ATSC took a different approach. Remember when I was saying before that mixes usually have some main element (often dialogue) that forms the center of the mix?

The ATSC called this main element an “anchor”. Since dialogue is usually this anchor, the system used an algorithm to detect speech and would only consider that to calculate the integrated level. I’ve done some tests with both Waves WLM and Nugen VisLM and the algorithm works pretty well, the integrated value doesn’t even budge when you are just monitoring non-dialogue content although singing usually confuses it.

In fact, on the 2011 update, the ATSC standard started to differentiating between regular programmes and commercials. Dialogue based gating would be used for the former while the all elements in the entire mix would be consider for the latter. This was actually one the main goals of the ITU standard initially: to avoid commercials being excessively loud in comparison to the programmes themselves.

Nevertheless, the ATSC updated the standard again in 2013 to follow the ITU BS.1770-3 directives and from then on all content would be measured using the same two gated method Europe uses. Because of this, I was tempted to just avoid mentioning all this ATSC history mess but I thought it was important to explain it, so yo can understand why some loudness plugins offer so many different ATSC options.

Here you can see the ATSC options on WLM. The first two would be pre 2013, using either dialogue detection or the whole mix to calculate the integrated time. The third, called “2013” used the gated method ala Europe.

TV Regional and National Standards

Now that we have a good idea of all the different characteristics standards use, let’s see how they compare.

Country / Region Standard Units Used Integrated Level True Peak Weighting Integrated level method
Europe EBU R128 LUFS -23 LUFS -1 dBTP K Relative Gate
US ATSC A/85 post 2013 LKFS -24 LKFS -2 dBTP K Relative Gate
US ATSC A/85 pre 2013 (Commercials) LKFS -24 LKFS -2 dBTP K All elements are considered
US ATSC A/85 pre 2013 (Programmes) LKFS -24 LKFS -2 dBTP K Dialogue Detection
Japan TR-B32 LUFS -24 LUFS -2 dBTP K Relative Gate
Australia OP-59 LKFS -24 LKFS -2 dBTP K Relative Gate

As you can see, currently, there are only small differences between them.

Loudness for Digital Platforms

I have tried to find the specifications for some of the most used digital platforms but I was only able to find the latest Netflix specs. Hulu, Amazon and HBO don’t specify their requirements or at least not publicly. If you need to deliver a mix to these platform, make sure they send you their desired specs. In any case, using the latest EBU or ATSC recommendations is probably a good starting point.

In the case of Netflix, their specs are very curious. They ask for a integrated level of -27 LKFS and a maximum true peak of -2 dBTP. The method to measure the integrated level would be dialogue detection, like the ATSC used to recommend, which in a way is a step back. Why would Netflix recommend this if the ATSC spec moved on to gated based measurements? Netflix basically says that when using the gated method, mixes with a large dynamic range tend to leave dialogue too low so they propose a return to the dialogue detection algorithm.

The thing is, this algorithm is old and can be inaccurate so this decision was controversial. A new, modern and more robust algorithm could be a possible solution for this high dynamic range mixes. Also, -27 LKFS may sound too low but it wasn’t chosen arbitrarily but based on the fact that that was the level where dialogue would usually end up on these mixes. If you want to know more about this, you can check this, this and this article.

Loudness for Theatrical Releases

The case of cinema is very different from broadcast for a very simple reason: you can expect a certain homogeneity in the reproduction systems that you won’t find in home setups. For this reason there is no hard loudness standard that you have to follow.

Dolby Scale SPL (dBC)
7 85
6.5 83.33
6 81.66
5.5 80
5 78.33
4.5 76.66
4 75
3.5 65

This lack of general standard has resulted in a similar loudness war to the one in the music mixing world. The result are lower dynamic ranges and many complains about cinemas being too loud. Shouldn’t cinema mixes offer a bigger dynamic range experience than TV? How are these levels determined?

Cinema screens have a Dolby box where the projectionist would set the general level. These levels are determined by the Dolby Scale and correspond to SPL measures under a C curve when using the “Dolby noise”. Remember that, in the broadcast world, the K curve is used instead which doesn’t help things when trying to translate between both.

Nowadays more and more cinemas are automated. This means that levels are set via software or even remotely. At first, all cinemas were using level 7, which is the one recommended by Dolby but as movies were getting louder and people complained, projectionists would start to use lower levels. 6, 5 and even 4.5 are used regularly. In turn, mixers started to work in those levels too which resulted in louder mixes overall in order to get the same feel. This, again, made cinemas lower their levels even more.

You see where this is going. To give you an idea, Eelco Grimm together with Michel Schöpping analyzed 24 movies available at dutch cinemas and found out levels that would vary wildly. The integrated level went from -38 LUFS to -20 LUFS, with the maximum Short-term level varying from -29 LUFS to -8 LUFS and the maximum True-Peak level varying from -7 to +3.5 dBTP. Dialogue levels varied from -41 to -25 LUFS. That’s quite a big difference, imagine if that would be the case in broadcast.

The thing is that despite these numbers being very different, we have to remember that all these movies probably were played at different levels on the dolby scale. Eelco says on his analysis:

  • The average playback level for movies mastered at '7' is -28 LUFS (-29 to -25).

  • The average playback level for movies mastered at '6.3' is -23 LUFS (-25 to -21). They are projected 3 dB softer, so if we corrected the average to a '7' level, it would be -26 LUFS.

  • The average playback level for movies mastered at '5' is -20 LUFS (all were -20). They are projected 7 dB softer, so the corrected average would be -27 LUFS.

So, as you can see, at the end dialogue level is equivalent to about -27 LUFS in all cases, the only difference is that the movies that were mixed at 7 (which is the recommended level) would have greater dynamic range, something important to be able to give a cinematic feel that TV can’t provide. The situation is quite unstable and I hope a solid solution based in the ITU recommendations is implemented at some point. If you want to know more about all this issue and read the paper that Eelco Grimm released, check this comprehensive article.

Loudness standards for video games.

Video games are everywhere: consoles, computers, phones, tablets, etc, so there is no clear standard to use. Having said that, some companies have stablished some guidelines. Sony, through their ASWG-R001 document recommends the following:

  • -23 LUFS and -1dBTP for Playstations 3 and 4 games.

  • -18 LUFS and -1dBTP for PSVita games.

  • The maximum loudness range recommended is 20 LU.

But how do you measure the integrated loudness in a game? Integrated loudness was designed for linear media so Sony’s document recommends to make measurements in 30 minutes sessions that are a good representation of different sections of the game.

So, despite games being so diverse in platforms and contexts using the EBU recommendations for consoles and PC (-23 LUFS) and a louder spec for mobile and portable games (-18 LUFS) would be a great starting point.

Conclusions and some plugins.

I hope you now have a solid foundation of knowledge for the subject. Things will keep changing so if your read this in the future, assume some of this information is outdated. Nevertheless, you would have hopefully learned the concepts you need to work with loudness now and in the future.

If you want to test loudness, many DAWS (including Pro Tools) don’t have a built-in meter that can measure LUFS/LKFS but there are plugins to solve this. I recommend that you try both Waves WLM and Nugen VisLM. If you can’t afford a loudness plugin, you can try Youlean, which has a free version and is a great one to start with.

Thanks for reading!

Figuring out: Gain Staging

What is it?

Gain staging is all about managing the audio levels of different layers within an audio system. In other words, when you need to make something louder, good gain staging is knowing where in the signal chain would be best to do this. 

I will focus this article on the realm of mix & post-production work under Protools, since this is what I do daily, but these concepts can be applied in any other audio related situation like recording or live sound.

Pro Tools Signal Chain

To start with, let's have a look at the signal chain on Protools:

Untitled Diagram (10).png

Knowing and understanding this chain is very important when setting your session up for mixing. Note that other DAWs would vary in their signal chain. Cubase, for example, offers pre and post-fader inserts while on Pro Tools every insert is always pre-fader except from the ones on the master channel.

Also, I've added a Sub Mix Bus (an auxiliar) at the end of the chain because this is how usually mixing templates are set up and is important to keep it in mind when thinking about signal flow.

So, let's dive into each of the elements of the chain and see their use and how they interact with each other.

Clip gain & Inserts

As I was saying, on Pro Tools, inserts are pre-fader. It doesn't matter how much you lower your track's volume, the audio clip is always hitting the plugins with its "original" level. This renders clip gain very handy since we can use it to control the clip levels before they hit the insert chain.

You can use clip gain to make sure you don't saturate your first insert input and for keeping the level consistent between different clips on the same track. This last use is specially important when audio is going through a compressor since you want roughly the same amount of signal being compressed across all the different clips on a given channel.

So what if you want a post-fader insert? As I said, you can't directly change an insert to post-fader but there is a workaround. If you want to affect the signal after the track's volume, you can always route that track or tracks to an auxiliar and have the inserts on that aux. In this case, these inserts would be post-fader from the audio channel perspective but don't forget they are still pre-fader from the aux channel own perspective.

Signal flow within the insert chain

Since the audio signal flows from the first to the last insert, when choosing the order of these plugins is always important to think about whatever goal you want to achieve. Should you EQ first? Compress first? What if you want a flanger, should it be at the end of the chain or maybe at the beginning?

I don't think there is definitive answer and, as I was saying, the key is to think about the goal you have in mind and whichever way makes conceptual sense to your brain. EQ and compression order is a classic example of this. 

The way I usually work is that I use EQ first to reduce any annoying or problematic frequencies, having also a high pass filter most of the time to remove unnecessary low end. Once this is done, I use the compressor to control the dynamic range as desired. The idea behind this approach is that the compressor is only going to work with the desired part of the signal.

I sometimes add a second EQ after the compressor for further enhancements, usually boosting frequencies if needed. Any other special effects, like a flanger or a vocoder would go last on the chain.

Please note that, if you use the new Pro Tools clip effects (which I do use), these are applied to the clip before the fader and before the inserts.

Channel Fader

After the insert chain, the signal goes through the channel fader or track volume. This is where you usually do most of the automation and levelling work. A good gain stage management job makes working with the fader much easier. You want to be working close to unity, that is, close to 0.

This means that, after clip gain, clip effects and all inserts; you want the signal to be at your target level when the fader is hovering around 0. Why? This is where you have the most control, headroom and confort. If you look closely at the fader you'll notice it has a logarithmic scale. A small movement next to unity would suppose 1 or 2 dB but the same movement down below could be a 10 dB change. Mixing close to unity makes subtle and precise fader movements easy and confortable.

Sends

Pro Tools sends are post-fader by default and this is the behaviour you would usually want most of the time. Sending audio to a reverb or delay is probably the most common use for a send since you want to keep 100% of the dry signal and just add some wet processed signal that will change in level as the dry also changes.

Pre-fader sends are mostly useful for recording and live mixing (sending a headphone mix is a usual example) and I don't find myself using them much on post. Nevertheless, a possible use on a post-production context could be when you want to work with a 100% of the wet signal regardless of how much of the dry signal is coming through. Examples of this could be special effects and/or very distant or echoey reverbs where you don't want to keep much of the original dry signal.

Channel Trim

Trim is pretty much like effectively having two volume lanes per track. Why would this be useful? I use trim when I already have an automation curve that I want to keep but I just want to make the whole thing louder or quieter in a dynamic way. Once you finish a trim pass, both curves would coalesce into one. This is the default behaviour but you can change it on Preferences > Mixing > Automation.

VCAs

VCAs are a concept that comes from analogue consoles (Voltage Controlled Amplifier) and allows you to control the level of several tracks with a single fader. They use to do this by controlling the voltage reaching each channel but on Pro Tools, VCAs are a special type of track that doesn't have audio, inserts, inputs or outputs.  VCA tracks just have a volume lane that can be used to control the volume of any group of tracks.

So, VCAs are something that you usually use when you want to control the overall level of a section of the mix as a whole, like the dialogue or sound effects tracks. In terms of signal flow, VCAs are just changing a track level via the track's fader so you may say they just act as a third fader (the second being trim).

Why is this better that just routing the same tracks to an auxiliar and changing the volume there? Auxiliars are also useful, as you will see on the next section, but if the goal is just level control, VCAs have a few advantages:

  • Coalescing: After every pass, you are able to coalesce your automation, changing the target tracks levels and leaving your VCA track flat and ready for your next pass.

  • More information: When using an auxiliar instead of a VCA track, there is no way to know if a child track is being affected by it. If you accidentally move that aux fader you may go crazy trying to figure out why your dialogue tracks are all slightly lower (true story). On the other hand, VCAs show you a blue outline (see picture below) with the real affected volume lane that would result after coalescing both lanes so you can always see how a VCA is affecting a track.

  • Post fader workflow: Another problem of using an auxiliar to control the volume of a group of tracks, is that if you have post-fader sends on those tracks, you will still send that audio away regardless of the parent's auxiliar level. This is because you are sending that audio away before you send it to the auxiliar. VCAs avoid this problem by directly affecting the child track volume and thus also affecting how much is sent post-fader.

Sub Mix buses

This is the final step of the signal chain. After all inserts, faders, trim and VCA, the resulting audio signals can be routed directly to your output or you may also consider using a sub mixing bus instead. This is an auxiliar track that sums all the signals from a specific group of channels (like Dialogue tracks) and allows you to control and process each sub mix as a whole.

These are the type of auxiliar tracks that I was taking about on the VCA section. They may not be ideal to control the levels of a sub mix, but they are useful when you want to process a group of tracks with the same plugins or when you need to print different stems.

An issue you may find when using them is that you may find yourself "fighting" for a sound to be loud enough. You feel that pushing the fader more and more doesn't really help and you barely hear the difference. When this happens, you've probably run out of headroom. Pushing the volume doesn't seem to help because a compressor or limiter further on the signal chain (that is, acting as a post-fader insert) is squashing the signal.

When this happens, you need to go back and give yourself more headroom by making sure you are not over compressing or lowering every track volume until you are working on manageable level. Ideally, you should be metering your mix from the start so you know where you are in terms of loudness. If you mix to any loudness standard like EBU-R128, that should give you a nice and comfortable amount of headroom.

Final Thoughts

Essentially, mixing is about making things louder or quieter to serve the story that is being told. As you can see, is important to know where in the audio chain the best place to do this is. If you keep your chain in order, from clip gain to the sub mix buses, making sure levels are optimal every step of the way. you'll be in control and have a better idea on where to act when issues arise. Happy Mixing.