Figuring out: Ambisonics

Here are my notes about the world of Ambisonics. This is a new area to me so, following this blog’s phylosophy, I will try to learn by explaining. Take this as an introduction to the subject.

The basic idea

We usually think about audio formats in terms of channels. Mono and stero being the most basic and used ones. If we open up the 2D space even more, we get surround audio like 5.1 and 7.1. Finally, the last step is to use the full 3D space and that’s where Ambisonics comes in.

The more complexity and channels we have, the harder is to make systems compatible between each other. In order to solve this, Ambisonics trascends the idea of channels and uses the concept of sound fields which represent planes of audio in 3D space.

This then allows to keep the aural information in a “speaker arrangement agnostic format” that can be decoded into any amount of speakers at the time of reproduction.

M/S Format

These planes of audio are represented in a special format called B-Format. You can think of this format as a natural extension of the M/S format so let’s start with that.

To get a M/S recording, we first use a figure of eight microphone facing sideways to the source (this is the “side”). This microphone will pickup the stereo information. At the same time, we use a cardioid microphone facing the source (this is the “mid”).

Once we want to decode these signals into stereo, we just need to sum mid and side to obtain the stereo left and then sum mid and the side with polarity reversed to obtain the right side. If you think about this, you realize that the “side” signal is basically a representation of the difference between left and right.

But why would we want to record things this way? Why not just record in stereo with a X/Y technique or similar? Recording using M/S has a few advantages. Firstly, we get automatic mono compatibility since we have the mid signal which we can use without fear of the phase cancellations that would happen if we sum the channles from an X/Y format. Additionally, since we can decode the M/S recording into stereo after the fact, we can control how wide we want the resulting stereo signal to be by just adjusting the balance between mid and side during decoding.

B-Format

Amisonics takes this concept and pushes to the next dimension, making it 3D by using additional channels to represent height and depth. B-format is then built with the following channels:

  • W: Contains the sound pressure information, similar to the mid signal in M/S. This is recorded with a omnidirectinal microphone.

  • X: Contains the front minus back pressure gradient. Recorded by a figure of eight microphone.

  • Y: Contains the left minus right pressure gradient, similr to the side signal in M/S. Recorded by a figure of eight microphone.

  • Z: Contains the top minus bottom pressure gradient. Recorded by a figure of eight microphone.

Note: A-Format is how we would call the raw audio from an ambisonic recording, that is, the individual signals from each microphone, while B-Format is used when we have already combined all these signals into a unique set.

Ambisonic Orders

The top row shows the W component, while the second one shows X, Y and Z. Additional rows show higher ambisonic orders for higher resolutions.

Using the B-format described above works but comes with some drawbacks. The optimal listener position would be quite small and results won’t be very natural outside it. Also, diagonal information is not very accurate, since it has to be inferred from the boundary between planes.

A solution to these issues is to increase resolution by adding more selective directonal components which instead of using traditional polar patterns would use other specific ones resulting in a signal set that contains denser aural information.

There is really no theoretical limit in how many additional microphones we can add to improve the resolution but of course there are clear prctical limits. For example, a third order ambisonics set, would use 16 tracks so is easy to see how hard drive space and microphone placement can quickly become a problem.

Decoding B-Format

Regardless of the number of ambisonics orders we use, the important thing to keep in mind is that the resulting recording will be not channel dependent. We can build the sonic information of any point in a 3D sphere by just knowing the angles to this point.

This allows us to create virtual microphones in this 3D space with which we can then match with any number of speakers. This is very powerful because once we have an ambisonics recording we can then play it on any speaker configuration preserving as much as the spatial information as the reproduction system allows.

If the final user is using headphones, a binaural signal would the result of the decoding while the same source files can be used to decode a 3D Dolby Atmos mix in cinemas.

Nowadays, you can find a big selection of ambisonic plugins for your DAW so you can play around with B-Format files including coding and decoding them in any other mutlichannel format you can imagine.

Use in media

Ambisonics was created in the 70s but has never been used much in mainstream media. This is now changing with the advent of VR experiences where 3D audio makes a lot of sense since the user can move around the scene focusing on different areas of the soundscape.

In the area of cinematic experiences ambisonics achieves a similar result as Dolby atmos or Auro-3D but using different methods. See my article about atmos to know more about this.

Google’s Resonane Audio allows you ti use ambisonics in Fmod

Regarding videogames or just generally interactive audio, ambisonics is a great fit. You can implement B-Format files in middleware like Fmod or Wwise and also game engines like Unity. This gives you the most flexibility since the ambisonics format will be decoded in real time into whatever the user is using to reproduce audio and this decoding will react in real time to their position and direction which is particularly awesome for VR.

In closing

There is much more to learn about this so I hope i get the chance to work with Ambisonics soon. I’m sure there are many details to keep in mind once you are hands on working with this formats and will try to document what I learn on this blog as I go.

Creating an Undipped M&E

What is an M&E

M&E stands for “Music & Effects”. You may also see it written as “ME” or “MandE”. In some places it is called “International Version” or “International Track”.

As the name suggests, the M&E is a stem that includes all the music, sound effects and foley but no dialogue. This stem is then used by dubbing studios to create a version in a different language.

If we are mixing a drama and this international version is required, every sound effect and foley element that was captured with the dialogue in location (like maybe steps, cloth movements, doors, etc) needs to be re-created since it will be lost when removing the original dialogue.

In the case of documentaries, we may keep the onscreen interviews audio for the international version (which will be dubbed on top in the new languague) but not include the voice over.

Creating and Using M&Es

When you build a mixing session, you need to consider the M&E so any audio with dialogue or even with any recognizable language (like an ambience track) doesn’t end up on the M&E. Specific reverbs per stem are also required if we want to deliver them more easily.

This is easy to achieve on any DAW with just buses and auxiliars. Here is a very simplified diagram showing how you a routing example. Bear in mind that this can get much more complicated fast as you add more tracks and deal with surround. I’ve also omitted VCAs and reverbs for the sake of simplicity.

As you can see, with this routing, the M&E Print would be free of any dialogue.

What is an undipped M&E and how is it useful?

In this context, “dipping” is simply the act of lowering the music and/or FX to acomodate dialogue or narration, as you can see on the picture below.

https://blog.frame.io/2017/08/09/audio-spec-sheet/

If we deliver an M&E created form the session above, you can see how the music would go down everytime there is dialogue on the mix which may not be ideal in some cases. So we would say that that M&E is dipped.

On the other hand, an M&E would be undipped if music and SFX levels don’t change to acomodate dialogue and instead are constant during the whole duration. Depending of the nature of the show we are working on, this may be what our client needs.

For cinema and dramatic content in general, a standard M&E (dipped) is usually sufficient since most dialogue has lipsync, meaning the person talking can be seen in shot when delivering the lines. Why is this relevant? Because when dubbing this kind of project into another language, the timing for each line is going to roughly be the same, so there the dips in volume will match.

But in the case of documentaries or some TV shows, things can be different. In this case, we may have narration or voice overs which don’t have any lip sync. We need to remember that some languagues are more or less condensed than others so narration may take more or less time.

So, imagine you give a dipped or standard M&E to the client and they start recording voice over on their languague. On this M&E levels are moving up and down to make room for the narration on the original language but these moves may not match the narration on the new one! In that case, they are going to be fighting our automation constantly to accomadate their newly recorded dialogue. Not ideal.

A better solution would be to give them an M&E that doesn’t dip at all, levels remain constant so that the dips can be done later, when the new languague is recorded. Let’s see how to do this.

How to create an undipped M&E

There are a few possible options on how to do this but this is what makes the most sense to me. Bear in mind that this basic structure may change (a lot) depending on the project’s needs. I think that, when possible, it is ideal to deliver both dipped and undipped to the client. My routing is build to do this. Let’s have a look:

As you can see, it looks a bit similar to the previous one but a bit more complex. The first thing to consider, is that we are now differentiating between synced dialogue and VO. This is because they need to be mixed in a different way. As you can see, only the VO is going to the Mix Master, while the synced dialogue is going to the M&E. This is not always the case, but is quite common to leave the original voice on an interview and dub on top of it with the new languague. Other than that, things flow in a similar fashion.

The other big concept to keep in mind is where to do automation. You would normally automate music on their own tracks and this is what we would do in our case but only for music that goes with synced dialogue. This is because in this case, we don’t mind having dips on the M&E. But for the case of music that goes with VO, we would have a problem if we automate the track itself, since all these changes are going to go directly to the M&E, which is what we want to avoid!

The solution then, is to do the automation on the M&E Master auxiliar everytime that the dips are around narration and VO. Once this is done, we can get two sends from this track. One of them would be a pre-fader send which would become our undipped M&E. The other would be post-fader send or just an output, which would be dipped. That way, we can easily bounce both. As you can see, I have labeled some tracks with A, B or C to indicate where the automation would be done.

The main drawback of this method is that, on occasion, you may feel like fighting between two automations (the track themselves and the M&E master auxiliar), specially when the editing goes from on screen interviews to narration very fast. I don’t see any good solution to this other than keeping the session tidy and making clean fader movements.

Something to keep in mind is that for some situations or for some people a better approach could be to do absolutely every music and effects automation on the M&E master and just don’t act diferently depending on dialogue having lypsinc. This could also work but it is a matter of taste and client needs.

Conclusion & Resources

So that’s pretty much it. This technique is a nice one to have on your toolbox. If you have been asked to deliver an undipped M&E now you know how to do it and why. This video below explains the same idea with some examples on a Pro Tools session. have a listen for extra info. Thanks for reading.

Pro Tools Batch Rename & Regular Expressions

Batch renaming was introduced into Pro Tools at the end of 2017, with the 12.8.2 version. Since then, I haven’t had much of a chance to use this feature since most of my work has been mixing and sound design. Nevertheless, after some recent days of voice acting recording and all the editing associated, I have been looking a bit into this feature.

So this is a quick summary of what you can do with it with some tips and examples.

Operations

There are two batch rename windows in Pro Tools, one for clips and another for tracks. They are, for the most part, identical. You can open each of them with the following shortcuts:

  • Clips: CTRL + SHIFT + R

  • Tracks: OPTION + SHIFT + R

Both windows also have a preset manager which is great to have.

As you can see, there are four different operations you can do: Replace, Trim, Add and Numbering. As far as I can tell, the different operations are always executed from top to bottom, so keep that in mind when designing a preset. Let’s see each of them in more detail:

Replace (CMD + R) allows you to search for any combination of letters and/or numbers and replace with a different one. The “Clear Existing Name” checkbox allows you to completely remove any previous name the track or clip had. This option makes sense when you want to start from scratch and use any of the other operations (add and numbering) afterwards.

For example, let’s say you don’t like when Pro Tools adds that ugly “dup1” to your track name when duplicating them. You could use a formula like this:

Original names New names

FX 1.dup1 FX 1 Copy
FX 2.dup1 FX 2 Copy
FX 3.dup1 FX 3 Copy

You may realise that this would only work with the first copy of a track. Further copies of the same track will be named “…dup2, ...dup3” so the replace won’t work. There is a way to fix that with the last checkbox, “Regular Expressions”. This allows you to create complex and advanced functions and is where the true power of batch renaming resides. More about it later.

Trim (CMD + T) is useful when you want to shave off a known amount of characters from the beginning or end of the name. You can even use the range option to remove characters right in the middle. This of course makes the most sense when you have a consistent name length, since any difference in size will screw up the process.

So, for example, if you have the following structure and you want to remove the date, you can use the following operation:

Original names New names

Show_EP001_Line001_280819_v01 Show_EP001_Line001_v01
Show_EP001_Line002_280819_v03 Show_EP001_Line002_v03
Show_EP001_Line003_280819_v02 Show_EP001_Line003_v02

Add (CMD + D) lets you insert prefixes and suffixes, pretty much doing the opposite of Trim. You can also insert any text at a certain index in the middle of the name.

We can add to the previous example a suffix to mark the takes that are approved. It would look like this:


Original names New names

Show_EP001_Line001_v01 Show_EP001_Line001_v01_Approved
Show_EP001_Line002_v03 Show_EP001_Line002_v03_Approved
Show_EP001_Line003_v02 Show_EP001_Line003_v02_Approved

Finally, Numbering (CMD + N) is a very useful operation that allows you to add any sequence of numbers or even letters at any index. You can choose the starting number or letter and the increment value. As far as I can tell, this increment value can’t be negative. If you want to use a sequence of letters, you need to check the box “Use A..Z” and in that case the starting number 1 will correspond with the letter “A”.

If we are dealing with different layers for a sound, we could use this function to label them like so:

Original names New names

Plasma_Blaster Plasma_Blaster_A
Plasma_Blaster Plasma_Blaster_B
Plasma_Blaster Plasma_Blaster_C

As you can see, in this case, we are using letters instead of numbers and and underscore to separate them form the name. Also, you can see that in the case of clips, you can choose wether the order comes from the timeline itself of from the clip list.

Regular Expressions

Regular expressions (or regex) are kind of an unified language or syntax used in software to search, replace and validate data. As I was saying this is where the true power of batch renaming is. In fact, it may be a bit overkill for Pro Tools but let’s see some formulas and tips to use regular expressions in Pro Tools.

This stuff gets tricky fast, so you can follow along trying out the examples in Pro Tools or using https://regex101.com/.

Defining searches

First off, you need to decide what do you want to find in order to replace it or delete it (replace with nothing). For this, of course you can search for any term like “Take” or “001” but obviously, you don’t need regex for that. Regex shines when you need to find more general things like any 4 digit number or the word “Mic” followed by optional numbers. Let’s see how we can do all this with some commands and syntax:

[…] Anything between brackets is a character set. You can use “-” to describe a range. For example, “[gjk]” would search for either g, i or k, while [1-6] means any number from 1 to 6. We could use “Take[0-9]“ to search for the word “Take” followed by any 1 digit number.

Curly brackets are used to specify how many times we want to find a certain character set. For example ”[0-9]” would look for any combination of numbers that is 5 digits long. This could be useful to remove or replace a set of numbers like a date which is always constant. You can also use ”[0-9]” to search for any number which is between 5 and 8 digits. Additionally, ”[0-9]” would look for any number longer than 5 digits.

There are also certain special instructions to search for specific sets of charaqcters. “\d” looks for any digit (number) type character, while “\w” would match any letter, digit or underscore character. “\s” finds any whitespace character (normal spaces or tabs).

Modifiers

When defining searches, you can use some modifiers to add extra meaning. Here are some of the most useful:

. (dot or full stop) Matches any character. So, “Take_.” would match any character that comes after the underscore.
+ (plus sign) Any number of characters. We could use “Take_.+” to match any number of character coming after the underscore.
^ (caret) When used within a character set means “everything but whatever is after this character:. So “[^a-d]” would match any character that is not a, b, c or d.
? (question mark) Makes a search optional. So for example, “Mic\d?“ would match the word Mic by itself and also if it has any 1 digit number after it.
* (Asterisk) Also makes a search optional but allowing multiple instances of said search. In a way, is a combination of + and ?. So for example, ”Mic\d*” would match “Mic” by itself, “Mic6” but also “Mic456” and, in general, the word Mic with any number of digits after it.
| (vertical bar) Is used to expressed the boolean “or”. So for example, “Approved|Aproved” would search for either of these options and apply the same processing to both if they are found.

Managing multiple regex in the same preset

You sometimes want to process several sections of a name and replace them with different things, regardless of their position and the content around them. To achieve this, you could create a regex preset for each section but is also possible to have several regex formulas in just one. Let´s see how we can do this.

In the “Find:” section, we need to use (…) (parenthesis). Each section encompased between parenthesis is called a group. A group is just a set of instructions that is processed as a separated entity. So if we want to search for “Track” and also for a 3 digit number we could use a search like this one “(Track)(\d)“. Now, it is important to be careful with what we use between the two groups depending of our goals. With nothing in between, Pro Tools would strictly search for the word track, followed by a 3 digit number. We may want this but tipically what we want is to find those terms wherever in the name and in whichever order. For this, we could use a vertical bar (|) in between the two groups like so: “(Track)|(\d)“ which is telling Pro Tools: hey, search for this or for this and then replace any for whatever.

But what if you want to replace each group for an specific different thing? This is easily done by also using groups in the ¨Replace¨section. You need to indentify each of them with “?1”, “?2” and so on. So the example on the right would search for the word “Track” anywhere in the name and replace ti with “NewTrack” and then it would search for any 3 digit number and replace it with “NewNumbers”

Here is a more complex example, involving 4 different groups. If you have a look at the original names, you will see this structure: “Show_EpisodeNumber_Character_LineNumber”. We would want to change the character and show to the proper names. We are also using a “v” character after the line number to indicate that this is the approved take by the client, it could be nice if we could transform this into the string “Approved”. Finally, Pro Tools adds a dash (-) and some numbers after you edit any clip and we would want to get rid of all of this. If you have a look at our regex, you would see that we can solve all of this in one go. Also, notice how the group order is not important since we are using vertical bars to separate them. You will see that in the third group, I’m searching for anything that comes after a dash and replacing it with just nothing (ie, deleting it), which can be very handy sometimes. So the clip names will change like so:

Original names New names

Show_045_Character_023-01 Treasure_Island_045_Hero_023
Show_045_Character_026v-03 Treasure_Island_045_Hero_026_Approved
Show_045_Character_045v-034 Treasure_Island_045_Hero_045_Approved

Other regex functions that I want to learn in the future

I didn´t have time to learn or figure out everything that I have been thinking regular expressions could do, so here is a list of things I would like to reasearch in the future. Maybe some of them are impossible for now. If you are also interested in achieving some of these things, leave a comment or send me an email and I could have a look in the future.

  • Command that adds the current date with a certain format.

  • Commands that add meta information like type of file, timecode stamp and such.

  • Syntax that allows you to search for a string of characters, process them in some way, and them use it in the replace section.

  • Deal with case sensitivity.

  • Capitalize or uncapitalize characters.

  • Conditional syntax. (If you find some string do A, if you don´t, do B).

Regex Resources:

https://regex101.com/
https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
https://www.youtube.com/playlist?list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD

Conclusion

I hope you now have a better understanding of how powerful batch renaming can be. With regular expressions I just wanted to give you some basic principles to build upon and have some knowledge to start building more complex presets that can save you a lot of time.

Figuring out: Audio Pull up/down

When working with video, an audio pull up or pull down is needed when there´s been a change in the picture´s frame rate and you need to tweak the audio to make sure it stays in sync.

This subject is somehow always surrounded by a layer of mysticism and confusion so this is my attempt of going through the basics and hopefuly get some clarity.

Audio Sampling Rate

First, we need to understand some basic digital audio concepts. Feel free to skip this if you have it fresh.

Whenever we are converting an audio signal from analogue to digital, all we are doing is checking where the waveform is at certain “points” in its oscilation. These “points” are usually called samples.

In order to get a faithful signal, we need to sample our waveforms many times. The number of times we do this per second is what determines sampling rate and is measured in Hertzs.

Keep in mind that if our sampling rate is not fast enough, we won´t be able to “capture” the higher frequencies since these would fluctuate faster than we can measure. So how fast do we need to be for accurate results?

The Nyquist-Shannon sampling theorem gives us a very good estimation. It basically says that we need about twice the sampling rate of the highest frequency we want to capture. Since the highest frequency humans can hear is around 20Khz, a sampling rate of 40Khz should suffice. Once we know this, let´s see the most comonly used sampling rates:

Sampling Rate Use
8 KhZ Telephones, Walkie-Talkies
22 Khz Low quality digital audio
44.1 Khz CD quality, the music standard.
48 KHz The standard for professional video.
96 Khz DVD & Blu-ray audio
192 Khz DVD & Blu-ray audio.
This is usually the highest sampling rate for professional use.

As you can see, most professional formats use a sampling rate higher than 40 Khz to guarantee that we capture the full frequency spectrum. Something that is important to remember and that will become relevant later on is that a piece of audio is always going to be the same lenght as long as it is played at the same sample rate that it was recorded.

For the sake of completion, I just want to mention audo resolution (or bit depth) briefly. This is the other parameter that we need to take into consideration when converting to digial audio. It measures hoy many bits we use to encode the information of each of our samples. Higher values will give us more dynamic range, since a bigger range of intensity values will be captured. This doesn´t really affect the pull up/down process.

Frames per second in video

Let´s now jump to the realm of video. There´s a lot to be said on the subject of frame rate but I will just keep it short. This value is simply how many pictures per second are put together to create our film or video. 24 frames per second (or just fps) is the standard for cinema, while TV uses 25 fps in europe (PAL) and 29.97 fps in the US (NTSC).

Keep in mind that these frame rates are different not only on a technical level but also on a stylistic level. 24 fps “feels” cinematic and “premium” while sometimes the higher frame rates used in TV feel “cheap”. This is probably a cultural perception and is definitely changing. Videogames, which many times use high frame rates like 60 fps and beyond, are partially responsible for this taste shift. The amount of motion is also very important, higher fps will be the best at showing fast motions.

But how can these different frame rates affect audio sync? The problem usually starts when a project is filmed at a certain rate and then converted to a different one for distribution. This would happen if, for example, a movie (24 fps) is brought into european TV (25 fps) or an american TV programme (29.97 fps) is brought into India, which uses PAL (25 fps).

Let´s see how this kind of conversion is done.

Sampling Rate vs Frame Rate

Some people think that audio can be set to be recorded at a certain frame rate the same way it can be set to be recorded at a certain sampling frequency. This is not true. Audio doesn´t intrinsically have a frame rate value the same way it has a bit depth and sampling rate.

If I give you an audio file and nothing else, you could easily figure out the bit depth and sampling rate but you would have no idea about the frame rate used on the associated video. Now, and here comes the nuanced but important point, any audio recorded at the same time with video will sync with the specific frame rate used when recording that video. They will sync because they were reocrded together. They will sync because what the camera registered as a second of video was also a second of audio in the sound recorder. Of course, machines are not perfect and their clocks may measure a second slightly different and that’s why we connect them via timecode but that’s another story.

This session is set at 24 fps, so each second is divided into 24 frames.

Maybe this confussion comes from the fact that when you create a new session or project in your DAW, you basically set three things: sampling rate, bit depth and frame rate. So it feels like the audio that is going to be inside is going to have those three intrinsic values. But that is not the case with frame rate. In the context of the session, frame rate is only telling your DAW how to divide a second. Into 24 slices? That would be 24 fps. Into 60 slices? That´s 60 fps.

In this manner, when you bring your video into your DAW, the video´s burnt in timecode and your DAW’s timecode will be perfectly in sync but all of this will change nothing about the duration or quality of the audio within the session.

So, in summary, an audio file only has an associated frame rate in the context of the video it was recorded with or to but this is not an intrinsic charactheristic of this audio file and cannot be determined without the corresponding video.

Changing Frame Rate

A frame rate change is usually needed when the medium (cinema, TV, digital…) or the region changes. There are two basic ways of doing this. One of them is able to do it without changing the final duration of the film, usually by re-distributing, duplicating or deleting frames to accomodate the new frame rate. I won’t go into details on these methods partly because they are quite complex but mostly because if the lenght of the final picture is not changed, we don´t need to do anything to the audio. It will be in sync anyway.

Think about this for a second. We have changed the frame rate of the video but, as long as the final leght is the same, our audio is still in sync which kind of shows you that audio has no intrinsic frame rate value. Disclaimer: This will be true as long as the audio and film are kept separated. If audio and picture are on the same celluloid and then you start moving frames around, obviously you are going to mess up the audio but in our current digital age we don’t need to worry about this.

The second method is the one that concern us. This is, when the lenght of the picture is actually changed. This happens because this is the easiest way to fix the frame rate difference, specially if it is not very big.

Telecine. How video frame rate affects audio.

Let´s use the Telecine case as an example. Telecine is the process of transfering a old fashion analogue film into video. This is not always the case but this usually also implies a change in frame rate. As we saw earlier, films are traditionally shot at 24 fps. If we want to broadcast this film in european television, which uses the PAL system at 25 fps, we would need to go from 24 to 25 fps.

The easiest way to do this is just play the original film 4% faster. The pictures will look faster and the movie will finish earlier but the difference would be tolerable. Also, if you can show the same movie in less time in TV that gives you more time for commercials, so win, win.

What are the drawbacks? First, showing the pictures a 4% faster may be tolerable but is not ideal and can be noticeable in quick action sequences. Second and more importantly, now our audio will be out of sync. We can always fix this by also playing the audio a 4% faster (and this would traditionally be the case since audio and picture were embed in the same film) but in this case, the pitch will be increased by 0.68 semitones.

In the digital realm, we can achieve this by simply playing the audio at a different rate that was recorded. This would be the digital equivalent to just cranking the projector faster. Remember before when I said that an audio file will always be the same leght if it is played at the same saple rate as recorded? This is when this becomes relevant. As you can see below, if we play a 48 KHz file at 50 KHz, we would get the same speed up effect that a change from 24 to 25 fps provides.

This would solve our sync problems, but as we were saying, it would increase the final pitch of the audio by about 0.68 semitones.

That increase in pitch may sound small but can be quite noticeable, specially in dialogue musical sections. So how do we solve this? For many years the simple answer was nothing. Just leave it as it is. But nowadays we are able to re-pitch the resulting audio so it matches its original sound or, alternativaly, we can directly change the lenght of the audio file without affecting the pitch. More on tese methods later but first let’s see what happens if, instead of doing a reasonable jump from film to PAL, we need to go from film to NTSC.

Bigger frame rate jumps, bigger problems (but not for us).

If a jump from 24 to 25 is a 4% change, a jump between 24 to 29.976 would be a whooping 24.9%. That´s way too much and it would be very noticeable. Let´s not even think about the audio, everybody would sound as a chipmunk. So how is this accomplished? The method used is what is called a “2:3 pulldown”.

Now, this method is quite involved so I’m not going to explain the whole thing here but let’s see the basics and how it will affect our audio. First let´s start with 30 fps as this was the original frame rate for TV in NTSC. This makes sense because the electrical grid works at 60 Hz in the states. But as people who, for some reason, are happy living this way, things were bound to get messy and after color TV was introduced and for reasons you can see here, the frame had to be dropped by a 1/1000th to 29.976.

A 2:3 pulldown uses the proportion of frames and the interlaced nature of the resulting video to make 4 frames fit into 5. This is because a 24/30 proportion would be equal to a 4/5 proportion. Again, this is complex and goes beyond the scope of this article but if you want more details this video can help.

But wait, we don’t want to end up with 30 frames, we need 29.97 and this is why the first step we do is slow down the film from 24 fps to 23.976. This difference is impossible to detect but crucial to make our calculations work. Once this is done, we can do the actual pulldown which doesn´t change further the lenght of the film, it only re-arranges the frames.

What does this all mean for us, audio people? It means that we only need to worry about that initial change from 24 to 23.976 which would just be a 0.1 % change. That’s small but it will still throw your audio out of sync during the lenght of a movie. So we just need to adjust the speed in the same way we do for the 4% change. If you look again at the picture above, you’ll see that that 0.1% is the change we need to use to go from film to NTSC.

As for the change in pitch, it will be very small but we can still correct it if we need with the methods I show you below. But before that, here is a table for your convenience with all the usual frame changes and the associated audio change that would be needed.

Frame Rate Change Audio Speed Change Pitch Correction (If needed)
Film to PAL 4% Up 4% Down // 96% // -0.71 Semitones
Film to NTSC 0.1% Down 0.1% Up // 100.1% // + 0.02 Semitones
PAL to Film 4% Down 4% Up // 104% // +0.68 Semitones
PAL to NTSC 4.1% Down 4.1% Up //104.1% // +0.68 Semitones
NTSC to Film 0.1% Up 0.1% Down // 99.9% // -0.02 Semitones
NTSC to PAL 4.1% Up 4.1% Down // 95.9% // -0.89 Semitones

Techniques & Plugins

There are two basic methods to do a pull up or pull down. The first involves two steps: first changing the duration of the file while affecting its pitch (using a different sample rate as explained before) and secondly applying pitch correction to match the original’s tone. The way to actually do the first step depends on your DAW but in Pro Tools, for example, you’ll see that when importing audio you have the option to apply SRC (Sample Rate Conversion) to the file as pictured above.

The second method is simply doing all at once with a plugin capable of changing the lenght of an audio file without affecting its pitch.

Also, keep in mind that these techniques can be applied to not only the stereo or the surround final mix file but also the whole session itself, which would give you much more flexibility to adjust your mix on this new version. This makes sense because a 4% change in speed could be enough to put two short sounds too close together and/or the feel of the mix could be a bit different. Personally, I have only used this “whole session” technique with shorter material like commercials. Here is a nice blog post that goes into detail about how to accomplish this.

As for changing a mixed file as a whole, wether you use a one step or two steps method, you will probably find that is easy to introduce glitches, clicks and pops in the mix. Sometimes you get dialogue that sounds metallic. Phase is also an issue, since the time/pitch is not always consistent between channels.

The thing is, time/pitch shift is not a easy thing to accomplish. Some plugins offer different algorithms to choose from depending on the type of material you have. These are designed with music in mind, not dialogue, so “Polyphonic” is the one that is usually the best option for whole mixes. Another trick you can use is to bounce your mix into stems: music, dialogue, FX, ambiences, etc and then apply the shift to each of them indepentdently, applying the best plugin and algorithm to each. This can be very time consuming but will probably give you the best results.

As you can see, this whole process is kind of tricky, particularly the pitch shift step and this is why in some occassions the audio is corrected for sync but left at the wrong pitch. Nevertheless, nowadays we have better shifting plugins to do the job. Here are some of the most commonly used, although remember that non of these works perfect in every ocassion:

-Zplane Elastique: This is in my opinion the best plugin and the one I personally use. It produces the least artefacts, keeps phase coherent and works great on whole mixes, even with single step processing.
-Pro Tools Pitch Shift: This is the stock time/pitch plugin that comes with Pro Tools. It is quite fast but is prone to create artifacts.
-Pro Tools X-Form: This one is more advanced (comes blunded with Pro Tools Ultimate) but it still suffers from some issues like giving dialogue a metallic tone or mesing the phase on stereo and surround. Also, it is slow. Veeeery slow.
-Serato Pitch n Time: I haven’t tried this either but I had to mention it since it is very commonly used and people swear by it.
-Izotope Time & Pitch: It can work well sometimes and offers many customizable settings that you can adjust to avoid artefacts.
-Waves Sound Shifter: Haven´t used it but it’s another option that seems to work well for some applications.

Which one should you choose? There is no clear answer, you will need to experiment with some of them to see what works for each project. Here is a good article and video comparing some of them.

Conclusions

I hope you now have somehow a better understanding on this messy subject. It is tricky from both a theoretical and practical level but I believe is worth figuring out where things come from instead of just doing what others do without really knowing why. Here are some takeaways:

  • Sampling rate and bit depth are intrinsic to an audio file.

  • At the same time, an audio file can be associated to a certain video frame rate when they are both in sync.

  • The frame rate change process is different depending on the magnitud of the change.

  • An audio pull up or pull down is needed when there is a frame rate chenge on the picture that affects its lenght.

  • The pull up/down can be done in two steps: lenght change first, then pitch correction or ir can be done in a single step.

  • Time/Pitch Shift is a complicated process that can produce artefacts, metallic timbres and phase issues.

  • Mixes can be processed by stems or even as whole sessions for more flexibility.

  • Try different plugins and algorithms to improve results.

Thanks for reading!