Methods · Sub-page

NatureLM-audio.

The first audio-language foundation model designed for bioacoustics. Audio in, natural language out. The workflow shift you can feel.

What it is

is a multimodal foundation model from Earth Species Project, published at ICLR 2025 (Robinson, Miron, Hagiwara, Pietquin et al., arXiv:2411.07186). The architecture pairs an audio encoder with a language model: audio frames flow into a shared representation, language flows out as captions, classifications, or descriptions.

Weights are publicly available on Hugging Face at EarthSpeciesProject/NatureLM-audio. The model card and the BEANS-Zero benchmark it was evaluated against are linked from the repo.

What it does well

Zero-shot species detection.Ask "is there a crow in this clip?" without training a classifier.
Zero-shot behavioral classification.Ask "is this a territorial call?" using natural-language prompts.
Captioning.Generate a one-sentence description of what's on the audio.
Embedding extraction. Pull the encoder representations for downstream similarity, clustering, retrieval.
Cross-species transfer. Methods that work on crow audio frequently transfer to ravens, jays, magpies — sometimes with no retraining at all.

What it doesn't do

NatureLM-audio is a captioning and classification model. It does notgenerate audio. It does not perform real-time synthesis. It does not produce a "crow dictionary" mapping calls to human-language glosses with semantic precision — what it captions reflects training priors, not crow intent. Treat captions as starting points for investigation, not as translations.

How it changed the workflow

Before NatureLM-audio, building a useful crow-vocalization pipeline meant training a custom classifier for each question — a multi-week process per question. After: prompt-engineering. You ask the model questions in English; it answers; you validate against held-out ground truth. A single afternoon of iteration replaces weeks of model building.

That shift is the workflow story of bioacoustics in 2025–26. Custom classifiers still happen — for high-stakes, high-volume tasks where every percentage point of accuracy matters. But the median question is now a prompt.

We are not trying to translate animal communication. We are trying to build the most useful tool we can, and use it carefully, and publish what we find.

Robinson et al. (2025) · NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics

Navigate CrowLingo

NatureLM-audio.

What it is

What it does well

What it doesn't do

How it changed the workflow