ICASSP 2022 · Google Research

Improving Bird Classification with Unsupervised Sound Separation

Presented by

Amitesh Badkul

Paper authors: Tom Denton Scott Wisdom John R. Hershey

Photo: Amitesh Badkul.

02

Why this matters

Bird sounds carry ecological context

Birdsong Apps

Birding apps can turn a walk into a species list in seconds.

Ecology indicator

Bird presence and absence reveal habitat quality, biodiversity, and ecosystem stress.

Classification failure

Real recordings contain overlap, wind, insects, and weak labels, so ordinary models miss faint birds.

03

Problem

Why ordinary bird classifiers struggle

Loud bird bias. Many training clips carry one main species label even if other birds are present.

Dawn chorus overlap. Several species sing together, so the input is a dense acoustic mixture.

Weak background supervision. Quiet background birds are often unlabeled, so the model learns to ignore them.

Listen first

Before the method, this is the intuition: field recordings are dense, noisy, and hard to separate by ear.

Demo media from Google Research, stored locally as bird-demo.mp4.

04

Training recipe

Data, augmentation, and evaluation setup

Training data (weakly labeled)

Xeno-Canto: weakly labeled field recordings.
Macaulay Library: preferred recordings in the combined pool.
Selection: high-rated files and background labels preferred.
Size: up to 250 recordings per target species, plus 250 out-of-set recordings.

Augmentations

Random 5-second time shift from each 6-second example.
Random gain: peak normalized between 0.05 and 0.75.
Example mixing with 50% probability.
Noise mixing in 75% of examples, 0-40 dB SNR.
Random low-pass filtering in mel space.

Evaluation sets (Multi-label)

Sapsucker Woods: 40k 5-second segments, 70 species.
High Sierras: 4,928 labeled 5-second segments, 18 species.
Caples: 2,944 labeled 5-second segments, 42 species.

Metrics

CMAP: class-averaged mean average precision.
lwlrap: label-weighted label-ranking average precision.
d′: sensitivity index.
Top-1 precision

Interpretation: the training recipe tries to reduce shortcut learning from microphone conditions, distance, and background noise.

05

Method overview

The pipeline in four moves

1

Activity detection

Pick bird rich 6 second windows from weakly labeled recordings.

2

MixIT separation

Train an unsupervised separator on mixtures of mixtures.

3

Channel classification

Run the classifier on each separated output and on the original clip.

4

Taxonomy support

Use species, genus, family, and order heads to stabilize training.

06

Step 1

Activity detection chooses likely bird rich windows

Input. A weakly labeled recording may be long and may contain silence or the wrong source.

Compute energy. Convert the recording to a log mel spectrogram and measure frame energy.

Find peaks. Use a wavelet peak detector to identify active frames.

Crop windows. Extract 6 second regions around the strongest peaks, up to five per recording.

Benefit

Raises the chance that training clips actually contain bird vocalization.

Limitation

High energy can still come from wind, insects, other birds, or speech.

07

Step 2

MixIT learns separation without clean source labels

MixIT mixture of mixtures separation training diagram

08

Step 3

Classify the separated channels and the original clip

Classifier combines original and separated channels into species scores

09

Results

Classification results

Method	CMAP	lwlrap	d′	Top 1
Sapsucker Woods · SSW · New York
Mix only	0.304	0.431	1.117	0.398
Separation only	0.268 -11.84%	0.413 -4.18%	1.116 -0.09%	0.360 -9.55%
Mix + separation	0.306 +0.66%	0.441 +2.32%	1.123 +0.54%	0.397 -0.25%
Caples Watershed · CAP · Lake Tahoe
Mix only	0.334	0.569	1.144	0.496
Separation only	0.327 -2.10%	0.581 +2.11%	1.154 +0.87%	0.506 +2.02%
Mix + separation	0.341 +2.10%	0.590 +3.69%	1.155 +0.96%	0.517 +4.23%
High Sierras · HSN · Dawn chorus
Mix only	0.527	0.531	1.149	0.432
Separation only	0.548 +3.98%	0.548 +3.20%	1.149 +0.00%	0.448 +3.70%
Mix + separation	0.560 +6.26%	0.560 +5.46%	1.153 +0.35%	0.451 +4.40%

10

Discussion