01
ICASSP 2022 · Google Research

Improving Bird Classification with Unsupervised Sound Separation

Presented by
Amitesh Badkul
Paper authors: Tom Denton Scott Wisdom John R. Hershey
Red-vented Bulbul perched on a branch
Photo: Amitesh Badkul.
02
Why this matters

Bird sounds carry ecological context

Birdsong Apps

Birding apps can turn a walk into a species list in seconds.

Ecology indicator

Bird presence and absence reveal habitat quality, biodiversity, and ecosystem stress.

Classification failure

Real recordings contain overlap, wind, insects, and weak labels, so ordinary models miss faint birds.

03
Problem

Why ordinary bird classifiers struggle

Loud bird bias. Many training clips carry one main species label even if other birds are present.
Dawn chorus overlap. Several species sing together, so the input is a dense acoustic mixture.
Weak background supervision. Quiet background birds are often unlabeled, so the model learns to ignore them.

Listen first

Before the method, this is the intuition: field recordings are dense, noisy, and hard to separate by ear.

Demo media from Google Research, stored locally as bird-demo.mp4.
04
Training recipe

Data, augmentation, and evaluation setup

Training data (weakly labeled)

  • Xeno-Canto: weakly labeled field recordings.
  • Macaulay Library: preferred recordings in the combined pool.
  • Selection: high-rated files and background labels preferred.
  • Size: up to 250 recordings per target species, plus 250 out-of-set recordings.

Augmentations

  • Random 5-second time shift from each 6-second example.
  • Random gain: peak normalized between 0.05 and 0.75.
  • Example mixing with 50% probability.
  • Noise mixing in 75% of examples, 0-40 dB SNR.
  • Random low-pass filtering in mel space.

Evaluation sets (Multi-label)

  • Sapsucker Woods: 40k 5-second segments, 70 species.
  • High Sierras: 4,928 labeled 5-second segments, 18 species.
  • Caples: 2,944 labeled 5-second segments, 42 species.

Metrics

  • CMAP: class-averaged mean average precision.
  • lwlrap: label-weighted label-ranking average precision.
  • d′: sensitivity index.
  • Top-1 precision
Interpretation: the training recipe tries to reduce shortcut learning from microphone conditions, distance, and background noise.
05
Method overview

The pipeline in four moves

1

Activity detection

Pick bird rich 6 second windows from weakly labeled recordings.

2

MixIT separation

Train an unsupervised separator on mixtures of mixtures.

3

Channel classification

Run the classifier on each separated output and on the original clip.

4

Taxonomy support

Use species, genus, family, and order heads to stabilize training.

06
Step 1

Activity detection chooses likely bird rich windows

Activity detector workflow diagram
Input. A weakly labeled recording may be long and may contain silence or the wrong source.
Compute energy. Convert the recording to a log mel spectrogram and measure frame energy.
Find peaks. Use a wavelet peak detector to identify active frames.
Crop windows. Extract 6 second regions around the strongest peaks, up to five per recording.
Benefit
Raises the chance that training clips actually contain bird vocalization.
Limitation
High energy can still come from wind, insects, other birds, or speech.
07
Step 2

MixIT learns separation without clean source labels

MixIT mixture of mixtures separation training diagram
08
Step 3

Classify the separated channels and the original clip

Classifier combines original and separated channels into species scores
09
Results

Classification results

Method CMAP lwlrap d′ Top 1
Sapsucker Woods · SSW · New York
Mix only0.3040.4311.1170.398
Separation only0.268 -11.84%0.413 -4.18%1.116 -0.09%0.360 -9.55%
Mix + separation0.306 +0.66%0.441 +2.32%1.123 +0.54%0.397 -0.25%
Caples Watershed · CAP · Lake Tahoe
Mix only0.3340.5691.1440.496
Separation only0.327 -2.10%0.581 +2.11%1.154 +0.87%0.506 +2.02%
Mix + separation0.341 +2.10%0.590 +3.69%1.155 +0.96%0.517 +4.23%
High Sierras · HSN · Dawn chorus
Mix only0.5270.5311.1490.432
Separation only0.548 +3.98%0.548 +3.20%1.149 +0.00%0.448 +3.70%
Mix + separation0.560 +6.26%0.560 +5.46%1.153 +0.35%0.451 +4.40%
10
Discussion

Thoughts / Open Questions

Could this method help with unseen species or open-set recognition?

Why is High Sierras the strongest-performing dataset despite being a dawn-chorus setting?

Can separation be integrated into classifier training, instead of only being used at inference?

Why does separation sometimes hurt the most prominent species?

Why was this line of work not pushed further more directly?