Stage 04

Embed — encoder choice is the third axis of every analysis.

The 1,024 numbers a model produces for your clip aren't a property of the clip. They're a property of the clip and the encoder. Treat both as variables.

The four encoders worth knowing

For crow audio specifically, four pretrained encoders cover the interesting cases:

BirdNET embeddings— fast, mature, biased toward species-discriminative features. Excellent for "is this a crow at all," less ideal for fine-grained within-crow structure.
Perch (PaSST/PANNs lineage, Google) — broader audio coverage, finer within-species detail. The default for graded-call work in 2024–25.
CLAP— joint audio-text. Useful if you want to query a corpus with natural-language prompts (“low territorial caw with rasp”) but expect rougher within-call geometry than Perch.
NatureLM-audio— Earth Species Project's audio-language foundation model for bioacoustics, ICLR 2025. SOTA on BEANS-Zero; supports zero-shot captioning. Heavier compute. Weights: EarthSpeciesProject/NatureLM-audio.

What it means to “live in a different space”

Embeddings from different encoders are not comparable. The cosine distance between two BirdNET vectors is meaningful; between a BirdNET vector and a NatureLM-audio vector, it is noise. Even fine-tuning the same base model on different downstream tasks shifts the geometry.

The practical rule: pick one encoder per project, document its version and weights hash, embed everything with it. If you must mix encoders (for example, to backfill old recordings), keep them in parallel namespaces and cross-validate by a small set of double-embedded clips.

Pretrained representations are the new ground truth in bioacoustics. Their biases are now everyone's biases.

Stowell (2022) · Computational bioacoustics with deep learning

pythonNatureLM-audio embedding

from naturelm_audio import NatureLM

model = NatureLM.from_pretrained("EarthSpeciesProject/NatureLM-audio")

# single clip
emb = model.embed("./crow_clip_001.wav")   # shape (1024,)

# batch
batch_emb = model.embed_many([
    "./clip_001.wav", "./clip_002.wav", "./clip_003.wav",
])  # shape (3, 1024)

pythonPersist with provenance

import pandas as pd, hashlib

df = pd.DataFrame({
    "path": paths,
    "embedding": [e.tolist() for e in embeddings],
    "encoder": "naturelm-audio",
    "encoder_version": "v0.3.1",
    "weights_sha256": hashlib.sha256(model.weights_bytes).hexdigest(),
    "sample_rate": 48_000,
    "timestamp_utc": pd.Timestamp.utcnow().isoformat(),
})
df.to_parquet("embeddings.parquet")

Navigate CrowLingo

Embed — encoder choice is the third axis of every analysis.

The four encoders worth knowing

What it means to “live in a different space”