Stage 03

Preprocess — clean what you got, conservatively.

Self-supervised models were trained on dirty audio. Heavy denoising removes the grain they learned to read. Light, reversible, transparent — the rule.

The minimum recipe

Convert to mono if not already.
Bandpass 200 Hz to 8 kHz to drop low rumble and ultrasonic content.
Peak-normalize to about −3 dBFS (don't loudness-normalize).
If the noise floor is audibly present, apply light non-stationary denoise.
Save the original. Save your processed. Save the processing parameters.

Why peak, not loudness

Loudness normalization rescales each clip toward a target LUFS, which means a quiet companion call and a loud territorial caw arrive at the model at the same perceived level. Context-relative loudness encodes urgency and proximity — both signal-bearing. Peak-normalizing keeps the relationship intact while making sure no clip is too quiet to embed cleanly.

Denoise carefully or not at all

The noisereducelibrary's stationary mode assumes a uniform noise profile and will eat broadband transients like wing-rustle that may matter for context. Non-stationary mode with a low prop_decrease (around 0.6) is safer. If you can hear the call clearly without denoising, ship the call without denoising.

Resampling decisions

Most encoders want 16 or 32 kHz. Resample at embedding time, not at preprocessing time — keep the 48 kHz source and let the encoder do its own resample with its preferred filter. Otherwise your data has a resample fingerprint baked in.

pythonConservative cleanup — librosa + noisereduce

import librosa
import numpy as np
import noisereduce as nr

def preprocess(path, out):
    y, sr = librosa.load(path, sr=48_000, mono=True)
    # bandpass 200 Hz – 8 kHz via FFT mask
    Y = np.fft.rfft(y)
    freqs = np.fft.rfftfreq(len(y), 1 / sr)
    Y[(freqs < 200) | (freqs > 8_000)] = 0
    y = np.fft.irfft(Y)
    # light non-stationary denoise
    y = nr.reduce_noise(y=y, sr=sr, stationary=False, prop_decrease=0.6)
    # peak-normalize to -3 dBFS
    y = librosa.util.normalize(y) * 0.707
    librosa.output.write_wav(out, y, sr)