Methods · Sub-page
NatureLM-audio.
The first audio-language foundation model designed for bioacoustics. Audio in, natural language out. The workflow shift you can feel.
What it is
is a multimodal foundation model from Earth Species Project, published at ICLR 2025 (Robinson, Miron, Hagiwara, Pietquin et al., arXiv:2411.07186). The architecture pairs an audio encoder with a language model: audio frames flow into a shared representation, language flows out as captions, classifications, or descriptions.
Weights are publicly available on Hugging Face at EarthSpeciesProject/NatureLM-audio. The model card and the BEANS-Zero benchmark it was evaluated against are linked from the repo.
What it does well
- Zero-shot species detection.Ask "is there a crow in this clip?" without training a classifier.
- Zero-shot behavioral classification.Ask "is this a territorial call?" using natural-language prompts.
- Captioning.Generate a one-sentence description of what's on the audio.
- Embedding extraction. Pull the encoder representations for downstream similarity, clustering, retrieval.
- Cross-species transfer. Methods that work on crow audio frequently transfer to ravens, jays, magpies — sometimes with no retraining at all.
What it doesn't do
NatureLM-audio is a captioning and classification model. It does notgenerate audio. It does not perform real-time synthesis. It does not produce a "crow dictionary" mapping calls to human-language glosses with semantic precision — what it captions reflects training priors, not crow intent. Treat captions as starting points for investigation, not as translations.
How it changed the workflow
Before NatureLM-audio, building a useful crow-vocalization pipeline meant training a custom classifier for each question — a multi-week process per question. After: prompt-engineering. You ask the model questions in English; it answers; you validate against held-out ground truth. A single afternoon of iteration replaces weeks of model building.
That shift is the workflow story of bioacoustics in 2025–26. Custom classifiers still happen — for high-stakes, high-volume tasks where every percentage point of accuracy matters. But the median question is now a prompt.
We are not trying to translate animal communication. We are trying to build the most useful tool we can, and use it carefully, and publish what we find.