Spectrogram peak detection with scipy

Introduction
Creating a spectrogram from an audio file
The peak detection algorithm
Some more examples

Figure 1: Spectrogram peak detection

Introduction

In this post I share a peak detection algorithm I came up with whilst working on the Birdclef 2023 Kaggle competition. The dataset is 16,941 recordings of Eastern African bird species, which I converted to spectrograms using a Fourier transform.

A spectrogram is a visual representation of the spectral content of a signal, typically represented as a graph of time versus frequency ¹. It is commonly used in signal processing, acoustics, and speech analysis to highlight patterns and structures in a sound wave.

Peak detection involves identifying the points of maximum amplitude in a signal or spectrum. This is a common technique used in signal processing and analysis, particularly in music and audio applications where identifying peaks can help to identify characteristics such as beats and harmonics. The process usually involves finding the local maxima, filtering out noise and selecting the most prominent peaks based on criteria such as their amplitude and frequency.

The imports for this project are:

import numpy as np
import matplotlib.pyplot as plt

from statistics import NormalDist

from sklearn.decomposition import PCA

from librosa import load as load_audio
from librosa import stft, amplitude_to_db
from librosa.display import specshow

from scipy.ndimage import label as label_features
from scipy.ndimage import maximum_position as extract_region_maximums

Creating a spectrogram from an audio file

This is quite straight forward in Python, all we need is librosa. The load function will return the samples as a numpy array. Then, stft will do a short-time Fourier transform ² and amplitude_to_db will produce the spectrogram, which is also a numpy array.

audio_file_path = "file path ..."
desired_sample_rate = 32000

# load samples into ndarray
samples, _actual_sample_rate = load_audio(
    path=audio_file_path,
    sr=desired_sample_rate
)

# do transformations
# fourier transform -> spectrogram
x = samples
x = stft(x)
x = amplitude_to_db(abs(x))

# plot
plt.figure(figsize=(14, 5))
specshow(x, sr=desired_sample_rate, x_axis="time", y_axis="hz")
plt.colorbar()

The specshow helper function will visualise the spectrogram as a heat map. The x axis is time, y is frequency, and z is intensity in decibels. It is important that the sample rate argument sr is correct, otherwise the time axis will not line-up with the original recording. Also, notice that the y axis starts at the bottom left, whereas the ndarray representation starts at the top left. The specshow function is flipping the array so that the visualisation is not upside down.

Figure 2: Spectrogram visualisation

I also experimented with principal component analysis (PCA), projecting the frequencies onto 10 principal components. Recall, PCA is a statistical technique used to reduce the complexity of high-dimensional datasets whilst retaining most of their variation ³. It transforms data into a new set of uncorrelated variables called principal components, making it easier to identify underlying patterns and structures.

Note: this is not part of the peak detection algorithm.

z = PCA(n_components=10).fit_transform(x)

Figure 3: Spectrogram visualisation after PCA

The above plot is clearly a lower resolution representation of the previous - you can clearly see the 10 components in the y axis plus the frequency spikes. However, all of the noise has been removed.

The following spectrograms are cropped so that the peaks are easier to see. They are also plotted with pyplot's matshow rather than specshow, therefore they are upside down and have a different colour scheme. However, the data is the same.

fig, ax = plt.subplots(figsize=(16, 8))
ax.matshow(x1[:500, :600])

Figure 4: Spectrogram visualisation (cropped)

The peak detection algorithm

The goal of the algorithm is to identify areas of high intensity (a peak) in the spectrogram. A peak as a high value and is surrounded by low value neighbours. The algorithm has a single threshold parameter which controls its sensitivity. In this example, I set the threshold to 2.75.

peak = (time, frequency, intensity)

The steps are as follows:

filter out zero values
calculate mean and standard deviation of intensities
assign a z score to each intensity in the spectrogram matrix
create a label matrix from the frequency intensities that are above threshold parameter
for each isolated region in the label matrix, extract the max intensity value from the original spectrogram
create a list of peak values and their locations

The implementation:

spectrogram = x

# number of standard deviations away from the mean
threshold = 2.75

# remove zero values
flattened = np.matrix.flatten(spectrogram)
filtered = flattened[flattened > np.min(flattened)]

# create a normal distribution from frequency intensities
# then map a zscore onto each intensity value
ndist = NormalDist(np.mean(filtered), np.std(filtered))
zscore = np.vectorize(lambda x: ndist.zscore(x))
zscore_matrix = zscore(spectrogram)

# create label matrix from frequency intensities that are
# above threshold
mask_matrix = zscore_matrix > threshold
labelled_matrix, num_regions = label_features(mask_matrix)
label_indices = np.arange(num_regions) + 1

# for each isolated region in the mask, identify the maximum
# value, then extract it position
peak_positions = extract_region_maximums(
    zscore_matrix, labelled_matrix, label_indices)

# finally, create list of peaks (time, frequency, intensity)
peaks = [(x, y, spectrogram[y, x]) for y, x in peak_positions]

The following visualisation shows the previous spectrogram plus its peaks overlaid as a scatter plot.

fig, ax = plt.subplots(figsize=(16, 8))

ax.scatter(
    x=[p[0] for p in peaks],
    y=[p[1] for p in peaks],
    s=1.5,
    color="red"
)

ax.matshow(x[:500, :600])

Figure 5: Spectrogram with peaks overlay

The points are not perfect - especially if there is a lot of background noise - some threshold fine-tuning is necessary. However, I have found 2.75 standard deviations to be good enough for most audio files in my dataset.