How audio conversion works

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the audio category.

Last Updated: 2024-04-18

This file is some random titbits in my journey to do some audio editing.

1. Is clipping detected through another tool (i.e. is Live wrong?)

# -n suppresses sound
# add some gain (0.1) for stuff on the edge
$ for i in *.mp3; do echo $i `sox $i -n --norm -R gain 0.1 2>&1`; done
# prints files that clip

=> no results => there must be other causes of distortion aside from clipping

2. Is dithering an issue?

These are errors than result after increasing/decreasing gain followed by quantization (rounding to nearest acceptable numerical value for a sample) of the individual samples

"It is precisely this error which manifests itself as distortion. What the ear hears as distortion is the additional content at discrete frequencies created by the regular and repeated quantization error."

amplitudes of time points
1 2 3 4 5 6 7 8

apply -20%
0.8 1.6 2.4 3.2 4.0 4.8 5.6 6.4

round results
1 2 2 3 4 5 6 6

The problem: many of these results are off, by up to 0.4

truncate instead
0 1 2 3 4 4 5 6

Similar issue, but with middle two 4's

A way around it - also built into sox, is to add some small amount of noise to mask the quantization effects (esp if output sample size < 24 bits)

Another idea: "Many audio engineers keep their audio files at 24 bits while they’re working on them, and reduce the bit depth to 16 bits when they’re finished processing the files or ready to burn to an audio CD. Advantage of this method is that when the bit depth is higher, less error is introduced by processing steps like normalization or adjustment of dynamics. "

3. How do WAV files work?

The most common WAV audio format is uncompressed audio in the linear pulse code modulation (LPCM) format. LPCM is also the standard audio coding format for audio CDs, which store two-channel LPCM audio sampled at 44,100 Hz with 16 bits per sample => 32 bits per sample => approx 4million

4. What do the lame (mp3 creator program) CLI options do?

-r

Assume the input file is raw pcm. Sampling rate and mono/stereo/jstereo must be specified on the command line. For each stereo sample, LAME expects the input data to be ordered left channel first, then right channel. By default, LAME expects them to be signed integers with a bitwidth of 16 and stored in little-endian. Without -r, LAME will perform several fseek()'s on the input file looking for WAV and AIFF headers. Might not be available on your release.

breaking apart: 1 left/right channel 2 little endian 3 s16 (signed)

=> somehow, when I remove this, it all breaks!

What does fluidsynth do?

fluidsynth -r 44100 -R 1 -E little -T raw -F - -O s16 soundfont_path midifile_path

-r 44100 sets sample rate to 44100 (sample rate= how many samples taken per second)

The sampling rate you use is directly related to the size of the range of frequencies you can sample. With a sampling rate of 44,100 Hz, the highest frequency you can sample is 22,050 (0.5 times sampling rate, due to Nyquist Theorem) BUT: We also know that musical instruments generate harmonic frequencies much higher than 20 kHz. Even though you can’t consciously hear those harmonics, their presence may have some impact on the harmonics you can hear. This might explain why someone with well-trained ears can hear the difference between a recording sampled at 44.1 kHz and the same recording sampled at 192 kHz.

-T is filetype. Default is wav, but I'm using raw. Might be to do with pipe.

-O s16 - s8, s16, s24, s32: Signed PCM audio of the given number of bits (floating points were available too)

Audio buffers?

Not included in the options fed to fluidsynth above is the number of audio buffers, controlled with -c, and their size (controlled with -z).

"Latency is the period of time between when an audio signal enters a system and when the sound is output and can be heard. Digital audio systems introduce latency problems not present in analog systems. It takes time for a piece of digital equipment to process audio data, time that isn't required in fully analog systems where sound travels along wires as electric current at nearly the speed of light. An immediate source of latency in a digital audio system arises from analog-to-digital and digital-to-analog conversions. Each conversion adds latency on the order of milliseconds to your system."

"Another factor influencing latency is buffer size. The input buffer must fill up before the digitized audio data is sent along the audio stream to output. Buffer sizes vary by your driver and system, but a size of 1024 samples would not be usual, so let's use that as an estimate. At a sampling rate of 44.1 kHz , it would take about 23 ms to fill a buffer with 1024 samples. " Why 23ms? Because 1 second/44100 samples * 1024 samples

Why is an input buffer needed in the first place? "This input buffer must be large enough to hold the audio samples that are coming in while the CPU is off somewhere else doing other work."

What happens with a small buffer? "If the input buffer is too small, samples have to be dropped or overwritten because the CPU isn't there in time to process them. " You'll hear breaks in the audio and other issues.

What happens with a large buffer? "It can hold the samples that accumulate while the CPU is busy (eg. writing recorded data to HD), but the amount of time it takes to fill up the buffer is added to the latency." i.e. the latency will suck.

The whole idea behind monitor headphones is that they take the analogue audio, therefore the singer won't get an echo that causes her to sing offbeat etc.

How to reduce latency: "In general, the way to reduce latency caused by buffer size is to use the most efficient driver available for your system"

So total latency is the sum of digital conversion overhead and buffer overhead.

What is raw data?

It is headerless, so tools like sox cannot infer shit about bitrate etc.

What can info can I get about a wav file?

 $ mediainfo output/all.wav

 General
 Complete name                            : output/all.wav
 Format                                   : Wave
 File size                                : 21.2 MiB
 Duration                                 : 2 min 6 s
 Overall bit rate mode                    : Constant
 Overall bit rate                         : 1 411 kb/s

 Audio
 Format                                   : PCM
 Format settings                          : Little / Signed
 Codec ID                                 : 1
 Duration                                 : 2 min 6 s
 Bit rate mode                            : Constant
 Bit rate                                 : 1 411.2 kb/s
 Channel(s)                               : 2 channels
 Sampling rate                            : 44.1 kHz
 Bit depth                                : 16 bits
 Stream size                              : 21.2 MiB (100%)

Mixing issue - need space for encoding

Ultimately, I realized that my distortion was due to feeding maximally loud signals into the sox mixer. The issue is that the mixer is constrained to the number of bits I configured it with. Therefore if I mix 4 channels of 16bit signals at max volume, their waves individually peak at 16 bits of height. When mixed, this even highher height cannot be represented, thus an error.

I can, however, reduce the volume before mixing. But this comes with a tradeoff too - at lower volumes, there is less information, so some of the richness of the sound is lost. This is heard as fuzz and weird distortion.

Therefore the solution is to ALSO increase the bit encoding of the input signals to 24-bit, at least for the audio processing period.

Resources