I want to analyze my music collection, which is all CD audio data (stereo 16-bit PCM, 44.1kHz). What I want to do is programmatically determine if the bass is mixed (panned) only to one channel. Ideally, I'd like to be able to run a program like this
mono-bass-checker music.wav
And have it output something like "bass is not panned" or "bass is mixed primarily to channel 0".
I have a rudimentary start on this, which in pseudocode looks like this:
binsize = 2^N # define a window or FFT bin as a power of 2
while not end of audio file:
read binsize samples from audio file
de-interleave channels into two separate arrays
chan0_fft_result = fft on channel 0 array
chan1_fft_result = fft on channel 1 array
for each index i in (number of items in chanX_fft_result/2):
freqency_bin = i * 44100 / binsize
# define bass as below 150 Hz (and above 30 Hz, since I can't hear it)
if frequency_bin > 150 or frequency_bin < 30 ignore
magnitude = sqrt(chanX_fft_result[i].real^2 + chanX_fft_result[i].complex^2)
I'm not really sure where to go from here. Some concepts I've read about but are still too nebulous to me:
- Window function. I'm currently not using one, just naively reading from the audio file 0 to 1024, 1025 to 2048, etc (for example with binsize=1024). Is this something that would be useful to me? And if so, how does it get integrated into the program?
- Normalizing and/or scaling of the magnitude. Lots of people do this for the purpose of making pretty spectograms, but do I need to do that in my case? I understand human hearing roughly works on a log scale, so perhaps I need to massage the magnitude result in some way to filter out what I wouldn't be able to hear anyway? Is something like A-weighting relevant here?
- binsize. I understand that a bigger binsize gets me more frequency bins... but I can't decide if that helps or hurts in this case.
I can generate a "mono bass song" using sox like this:
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine40hz_mono.wav synth 5.0 sine 40.0
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine329hz_mono.wav synth 5.0 sine 329.6
sox -M sine40hz_mono.wav sine329hz_mono.wav sine_merged.wav
In the resulting "sine_merged.wav" file, one channel is pure bass (40Hz) and one is non-bass (329 Hz). When I compute the magnitude of bass frequencies for each channel of that file, I do see a significant difference. But what's curious is that the 329Hz channel has non-zero sub-150Hz magnitude. I would expect it to be zero.
Even then, with this trivial sox-generated file, I don't really know how to interpret the data I'm generating. And obviously, I don't know how I'd generalize to my actual music collection.
FWIW, I'm trying to do this with libsndfile and fftw3 in C, based on help from these other posts: