Unified Audio Source Separation (Defense Slides)

Unified Audio Source Separation Kohei Saijo Waseda University 1 2026/01/21

Audio Source Separation uHumans possess an ability to understand complex
acoustic environments • Can isolate and recognize desired sources flexibly and adaptively uSource separation: Separate a mixture of sounds into individual sources • Fundamental technology to implement an auditory scene analysis ability in machines 2

Unified Source Separation uMost existing models: task-specific design uGoal: Task-unifying
system 3 Fixed number of sources A single source granularity Only a few classes of sources Variable number of sources Multiple granularity w/ explicit control Broad classes of sources

Table of Contents uUnified Source Separation Model • Transformer-based separation
model (Chapter 3.2) • Unified source separation model based on prompting (Chapter 6.3) uEncoder/Decoder for Unified Source Separation Model • Cross-attention-based encoder/decoder for spectral feature compression (Chapter 3.3) uFuture Directions • Large-scale pre-training by unsupervised learning • Unified source separation with input universality 4

Motivation uDifferent applications (or tasks) have different target sources of
interest • Task-specific models needs to be deployed for each application, which is not efficient uNN’s powerful modeling capability may enable unifying all separation tasks • LLMs handles various tasks that were originally handled by specialist models • To address all separation tasks, the model needs to handle (i) arbitrary classes of and (ii) a variable number of sources, with (iii) an explicit control of granularity 6 Task Sources of interest Speech enhancement (SE) Speech, Noise Speech separation (SS) Speech × ", Noise Environmental sound separation (USS) Sound effects (SFX) × " Music source separation (MSS) Vocals, Bass, Drums, Other inst. Cinematic audio source separation (CASS) Speech, SFX-mix, Music-mix

How Can We Build a Unified Source Separation Model? uRequirements
to build a unified source separation model 1. A conditional model which can change its behavior in inference 2. A model that accepts a variable number of prompts 3. A model that accepts multiple identical prompts 7 Separation Model Mixture <Prompt 1> <Prompt !> ・・・・・・ ! prompts ! sources ・・・

How Can We Satisfy the Requirements? uTransformer-based separation models can
satisfy all the requirements 1. The model can change its behavior by e.g., prompting 2. Transformers work regardless of the input sequence length 3. Positional encoding makes prompts different from each other 8 Separation Model Mixture <Prompt 1> <Prompt !> ・・・・・・ ! prompts ! sources ・・・

TF-domain Dual-Path Separation Models uBase of the current state-of-the-art separation
model • Alternate sequence modeling along time and frequency dimensions • LSTM is very strong [Wang+, 2023], but would be great to have a Transformer alternative • Scalability, prompting, etc. 9 Frequency modeling Temporal modeling Conv2D + gLN STFT iSTFT Deconv2D Separated signals ! ∈ ℝ!×# Mixture $ ∈ ℝ# 2×%×& '×%×& 2×%×& ×" blocks # $ % # $ # $ # Frame 1 Frame 2 Frame ! $ ・・・・・・ Seq. modeling Seq. modeling Seq. modeling ・・・ # $ % [Wang+, 2023]: Z.-Q. Wang et al., “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM TASLP, 2023.

TF-Locoformer uTF domain Transformer with LOcal modeling by COnvolution •
A design inspired by Conformer [Gulati+, 2020] and Transformer++ [Gu+, 2024] † models 10 [Gulati+, 2020]: A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," Interspeech, 2020. †: Existing model named at [Gu+, 2024]: A. Gu et al., "Mamba: Linear-time sequence modeling with selective state spaces." First conference on language modeling. 2024.

Key Components of TF-Locoformer uConv-SwiGLU FFN • Convolutional layers inspired
by Conformer • SwiGLU activation inspired by Transformer++ uMacaron-style architecture • Two FFNs before and after MHSA uRMSGroupNorm • Split !-dimensional vector into " groups and normalize each D/"-dimensional vector • This may encourage disentanglement of each source’s feature 11 # % $ # % # % $/' $/' ・・・ Normalize Normalize ・・・ # % $

Speech Separation Experiments uDataset: WSJ0-2mix (anechoic) and WHMAR! (noisy reverberant)
uMetric: SI-SDR [dB] uResults: • Comparable performance to LSTM-based model, TF-GridNet • Better performance with larger model 12 Model #params WSJ0-2mix WHAMR! TF-GridNet (S) 5.5 M - 17.1 TF-GridNet (M) 14.4 M 23.5 - TF-Locoformer (S) 5.0 M 22.0 17.4 TF-Locoformer (M) 15.0 M 23.6 18.5 TF-Locoformer (L) 22.5 M 24.2 -

Unified Source Separation based on TF-Locoformer uTF-Locoformer satisfies the requirements
to build prompting-based models • Sequence-length invariant • But can make prompts different from each other thanks to the positional encoding 13 Separation Model Mixture <Prompt 1> <Prompt !> ・・・・・・ ! prompts ! sources ・・・

Prompts to be Considered u8 types of prompts • Speech:
<Speech> • Sound effects: <SFX-mix> <SFX> • Music: <Music-mix> <Drums> <Bass> <Vocals> <Other inst.> uMajor tasks can be covered by changing the combination of prompts 14 Task Prompts Speech enhancement (SE) <Speech>, <SFX-mix> Speech separation (SS) <Speech> x N, <SFX-mix> Environmental sound separation (USS) <SFX> x N Music source separation (MSS) <Drums> <Bass> <Vocals> <Other inst.> Cinematic audio source separation (CASS) <Speech>, <SFX-mix>, <Music-mix>

Task-aware Unified Source Separation (TUSS) [Saijo+, ICASSP2025] uA model that
satisfies the requirements by using self-attention 15 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Decoder Decoder Prompts: <Speech>, …, <SFX-mix> ・・・・・・ Shared Shared Encoder ・・・ Mixture Speech SFX-mix

1. Encoder uSTFT-domain band-split module [Luo+, IEEE/ACM TASLP2023] • Applies
STFT to the mixture waveform % ∈ ℝ#×$×% • Further encodes the spectrogram into ( ∈ ℝ&×'×% 16 ! " # Encoder Mixture [Luo+, 2023]: Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM TASLP, 2023.

2. Cross-prompt Module Mixes the prompts and the encoded feature
via self-attention • The prompts and the encoded feature are conditioned on each other • Enables us to use a variable number of prompts and multiple identical prompts 17 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix ・・・ Prompts: <Speech>, …, <SFX-mix> ・・・ Encoder Mixture

3. Conditional TSE Module uProcesses each pair of a prompt
and the feature one by one • Conditioning by element-wise product: (( = *( ⊙ ( ∈ ℝ&×'×% • Further applies some TF-Locoformer blocks • Variable number of prompts is acceptable as the TSE module is shared for all , 18 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Prompts: <Speech>, …, <SFX-mix> ・・・・・・ Shared Encoder ・・・ Mixture

4. Decoder uBand-wise decoding to project the separated feature back
to STFT-domain • Inverse transformation of the encoder • Applied independently to each source 19 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Decoder Decoder Prompts: <Speech>, …, <SFX-mix> ・・・・・・ Shared Shared Encoder ・・・ Mixture Speech SFX-mix

Experimental Setup uTraining Data: on-the-fly mixing of audios from collection
of public data • Randomly sample 2-4 prompts and the corresponding audio data, and mix them uValidation/testing data: public benchmarks for each task • Voicebank-DEMAND (SE), WHAM! (SS), FUSS (USS), MUSDB-HQ (MSS), DnR (CASS) 20 Category Datasets Speech VCTK, WSJ, LibriVox SFX FSD50K SFX-mix WHAM! DEMAND, FSD50K Music Inst. MUSDB-HQ, MOISESDB Music-mix FMA, MUSDB-HQ, MOISESDB

Unified Source Separation Experiments uMethods: • Unconditional: unconditional separation model
with fixed number of outputs • TUSS: proposed TUSS-based conditional model uMetrics: • SI-SDR [dB], except for MSS where we use SNR [dB] uResults: • Prompting-based unified model outperforms the unconditional model • Successfully handled tasks that require different #sources and granularity 21 Method SE SS USS MSS CASS Average Unconditional 14.0 6.8 8.1 4.9 6.0 8.0 TUSS 14.8 9.1 9.6 6.8 9.1 9.9

Comparison Specialist Models uTwo types of models trained on different
data • Specialist: a model trained on all the data for that task • TUSS: a model trained on all the data available uResults: • Unified model could not outperform the specialist models • Generally, larger model benefits from larger data • What if we size up the model? 22 Type SE SS USS MSS CASS Average Specialist 15.9 10.3 10.2 8.3 9.7 10.9 TUSS 14.8 9.1 9.6 6.8 9.1 9.9

Results on Larger Model uTUSS achieved comparable performance to the
specialist on several tasks • As expected, larger model benefits from large data • Showed its potential to serve as a foundation model for source separation 23 Model size Type SE SS USS MSS CASS Average Medium Specialist 15.9 10.3 10.2 8.3 9.7 10.9 TUSS 14.8 9.1 9.6 6.8 9.1 9.9 Large Specialist 16.0 11.4 10.0 9.1 10.0 11.3 TUSS 15.1 10.3 12.2 7.4 10.1 11.0

Controllability at Inference 24 Speech SFX(-mix) Music-mix (vocals+other) <Speech>, <SFX-mix>,
<Music-mix> <Speech>, <SFX>, <SFX>, <Music-mix> <Speech>, <SFX-mix>, <Vocals>, <Other>

Controllability at Inference 25 <Speech>, <SFX>, <SFX>, <Vocals>, <Other> TUSS
trained with up to 4 prompts shows successful separation even with 5 prompts. Speech SFX(-mix) Music-mix (vocals+other) <Speech>, <SFX-mix>, <Music-mix>

Challenge in TF-Locoformer: Computational Cost 27 uComputational cost is proportional
to #frames ! and #bins " • - can be large (e.g., 1025) when sampling rate is high • Music source separation, • Cinematic audio source separation, etc. • Unified source separation system needs to handle such data Encoder Separator ・・・ Input Outputs ! " #×"×! ! " Decoder TF-domain dual-path models #×"×! ! " ! " Frequency modeling Temporal modeling ×% "

Band-split (BS) encoder/decoder [Luo+, 2023] uSubband-wise encoding/decoding with # sub-encoders/sub-decoders
28 [Luo+, 2023]: Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM TASLP, 2023. Encoder Separator ・・・ Input Outputs ! " #×%×! ! " Decoder Split ・・・ Enc. ! Enc. " Enc. # ・・・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression Encoder Separator ・・・ Input Outputs ! " #×%×! ! " Decoder Split ・・・ Enc. ! Enc. " Enc. # ・・・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression

Psychoacoustically Motivated Inductive Bias uBand-splitting with psychoacoustic knowledge 29 Figure
from: K. N. Watcharasupat, et. al., “A generalized bandsplit neural network for cinematic audio source separation,” IEEE OJSP, 2023. Encoder Separator ・・・ Input Outputs ! " #×%×! ! " Decoder Split ・・・ Enc. ! Enc. " Enc. # ・・・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression

Inherent Limitations of BS Module 1. Encoding/decoding process (linear layer
or MLP) is not input-adaptive • It cannot leverage input-dependent information 2. Limited receptive field • It inherently limits the receptive field to incorporate inductive bias 3. Large parameter counts • Total #params. of the encoder and decoder is twice more than that of the separator 30 Encoder Separator ・・・ Input Outputs ! " #×%×! ! " Decoder Split ・・・ Enc. ! Enc. " Enc. # ・・・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression

Compression by Sequence Modeling Goal: to design the encoder/decoder that
satisfies 1. Input-adaptive encoding/decoding leveraging input-dependent information 2. Unlimited receptive field 3. Small parameter counts Sequence modeling satisfies all the requirements • Cross-attention with a query of length . (Perceiver IO [Jaegle+, 2022]) 31 Sequence modeling Query ! " #×%×! #′×%×! [Jaegle +, 2022]: A. Jaegle, et. al., “Perceiver IO: A General Architecture for Structured Inputs & Outputs,” in Proc. ICLR, 2022.

Overall Separation Pipeline based on Perceiver IO uEncoder $: ℝ!×#
→ ℝ$×%×# • Encode the spectrogram into !-dimensional feature while compressing frequency dim. uDecoder (: ℝ$×%×# → ℝ!×# • Restore the original frequency resolution 32 Feature ! "ℰ ∈ ℝ!!×#×$ Query %ℰ ∈ ℝ!!×&×$ Encoder ℰ Conv2D + Norm Conv2D + Norm Mixture & ∈ ℝ'(×#×$ ! " Compressed Feature !ℰ ∈ ℝ!!×&×$ Cross Attention Key Query Value Separator Compressed Feature !" ∈ ℝ!!×&×$ Query %) ∈ ℝ!!×#×$ Deconv 2D ! Cross Attention Key Query Value Deconv 2D Feature ! "" ∈ ℝ!!×#×$ Decoder " Masks ' ( ∈ ℝ'(*×#×$ #-th time frame

Preliminary Experiment: Band-split vs. Cross-attention uCross-attention (pink) is much worse
than the band-split module (purple) • Likely because cross-attention does not leverage any inductive bias 33

Positional Bias to Incorporate Inductive Bias 34 uBand-split • The
/-th encoder/decoder is in charge of the /-th band uCross-attention • The /-th query is in charge of to /-th band • Needs to learn where the /-th should attend, but the model failed uSolution • Incorporate psychoacoustically motivated inductive bias in cross-attention • Reformulate CA with a positional bias * ∈ ℝ)"×)#$ [Press+, 2022] CrossAttention(:, <, =): = softmax :<* ! + * = [Press+, 2022]: O. Press, et. al., “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.

Positional Bias as Attention Mask 35 uCA with a positional
bias ) ∈ ℝ&!×&"# E = softmax :<* ! + * = uWe can enforce the +-th query to attend to the +-th band • But this limits the receptive field Encoder +ℰ , - -∞ -∞ -∞ -∞ 0 0 0 0 -∞ -∞ 0 0 -∞ -∞ -∞ -∞ 0 0 -∞ -∞ -∞ -∞ -∞ -∞ Band-split config . = [ 1, 3 , 4, 5 , [6, 7, 8, 9]] % = 1 % = 2 % = 3 -∞ -∞ 0 -∞ -∞ 0 -∞ -∞ 0 -∞ -∞ 0 -∞ 0 -∞ -∞ 0 -∞ 0 -∞ -∞ 0 -∞ -∞ , Decoder +" -

Design of Positional Bias 36 uCA with a positional bias
) ∈ ℝ&!×&"# E = softmax :<* ! + * = u) is designed to encourage the +-th query to attend to the +-th band • This does not limit the receptive field Encoder +ℰ , - -4 -3 -2 -1 -0.5 0 0 -0.5 -2 -1 0 0 -1 -2 -3 -4 0 0 -1 -2 -3 -4 -5 -6 % = 1 % = 2 % = 3 -6 -4 -0.5 -5 -3 0 -4 -2 0 -3 -1 -0.5 -2 0 -1 -1 0 -2 0 -1 -3 0 -2 -4 Decoder +" - Band-split config . = [ 1, 3 , 4, 5 , [6, 7, 8, 9]]

Visualization of Positional Bias uEncoder’s positional bias based on the
Musical band config [Watcharasupat+, 2023] • Showing the positional bias after applying softmax for better visualization 37 [Watcharasupat+, 2023]: K. N. Watcharasupat, et. al., “A generalized bandsplit neural network for cinematic audio source separation,” IEEE OJSP, 2023.

Final Form of Proposed SFC-CA uCA-based Spectral Feature Compression (SFC-CA)
with positional bias • Positional bias introduces an inductive bias, analogous to the BS module • Built upon the Perceiver IO framework but specifically designed for compressing frequency information 38 Feature ! "ℰ ∈ ℝ!!×#×$ Query %ℰ ∈ ℝ!!×&×$ Encoder ℰ Conv2D + Norm Conv2D + Norm Mixture & ∈ ℝ'(×#×$ ! " Compressed Feature !ℰ ∈ ℝ!!×&×$ Cross Attention Key Query Value Positional bias 'ℰ ∈ ℝ&×# Separator Compressed Feature !" ∈ ℝ!!×&×$ Query %) ∈ ℝ!!×#×$ Deconv 2D ! Cross Attention Key Query Value Positional bias ') ∈ ℝ#×& Deconv 2D Feature ! "" ∈ ℝ!!×#×$ Decoder " Band config # = Posi Encoder 0ℰ ∈ ℝ#× " ! -0.5 -1 -2 -3 -4 -1 0 0 -1 -2 -3 -2 -1 0 0 Masks ( ) ∈ ℝ'(*×#×$ 3-th time frame

Preliminary Experiments on MSS and CASS uEncoder/decoder setup: , =
1025, 3 = 64 (Musical band config, p29) uMSS: SNR [dB] on MUSDB18HQ uCASS: SNR [dB] on DnR 39 Model Enc/dec Params Vocals Bass Drums Other Avg. TF-Loco. (S) Band-split 34.7M 9.0 8.2 9.9 5.9 8.3 SFC-CA 5.8M 9.6 8.7 10.8 6.7 9.0 TF-Loco. (M) Band-split 55.5M 9.6 8.9 10.4 6.2 8.8 SFC-CA 16.0M 10.2 9.2 11.1 7.1 9.4 Model Enc/dec Params Speech Music SFX Avg. TF-Loco. (S) Band-split 34.7M 15.6 8.8 9.8 11.4 SFC-CA 5.8M 15.9 9.3 10.2 11.8 TF-Loco. (M) Band-split 55.5M 16.1 9.4 10.3 11.9 SFC-CA 16.0M 16.4 9.7 10.6 12.2

Integrating SFC into TUSS uReplacing the BS encoder/decoder with SFC
40 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module SFC-Decoder SFC-Decoder Prompts: <Speech>, …, <SFX-mix> ・・・・・・ Shared Shared SFC-Encoder ・・・ Mixture Speech SFX-mix

Unified Source Separation Experiments uModel: TUSS medium (" = 6789)
• BSRoformer’s band config: . = 61, without inter-band overlap • Musical band config: . = 64, with inter-band overlap uEvaluation on SI-SDR [dB] (except for MSS where SNR [dB] is used): • SFC-CA performs consistently better than the Band-split module 41 Enc/Dec Band config SE SS USS MSS CASS Average Band-split* BSRoformer (61) 15.2 9.0 9.1 6.8 9.1 9.8 SFC-CA BSRoformer (61) 15.4 9.1 10.6 6.8 9.2 10.2 Band-split Musical (64) 14.8 8.6 8.3 6.8 8.8 9.5 SFC-CA Musical (64) 15.5 9.2 10.4 7.2 9.3 10.3 *: The result is different from the previous slides, since batch size is smaller (8->4)

Data Scarcity in Source Separation uSource separation faces data scarcity
• A mixture and reference sources cannot be recorded at the same time • Collecting reference sources of diverse classes of sounds is challenging • Environmental sounds, music, etc. uTo address data scarcity, unsupervised learning is promising • Great if the method works on monaural data with unknown number of sources 43

Self-Remixing uUnsupervised source separation by iterating separation and remixing •
The model is trained to reconstruct the initial mixture from pseudo-mixtures • Enables unsupervised learning from unlabeled monaural mixtures 44

Initialize Large-scale Unsupervised Pre-training for TUSS uWe can pre-train the
backbone separation model with Self-Remixing • Training TUSS in an unsupervised manner is not easy, as it cannot control granularity • But we can first pre-train an unconditional model and use it to initialize TUSS 45 Separation Model Unsupervised Pre-training Large-scale Unlabeled data Separation Model Supervised Fine-tuning Labeled data TUSS framework

Unified Source Separation with Input Universality 46 Variable number of
sources Multiple granularity w/ explicit control Broad classes of sources Output Universality Variable number of microphones Multimodal prompts Various types of distortions Input Universality Universal Model

Summary uGoal: Unified Source Separation • Separating a variable number
and arbitrary classes of sources with controllability uApproaches • TF-Locoformer • A Transformer-based separation model which becomes the basis of TUSS • Task-aware Unified Source Separation (TUSS) • A framework to controls #sources and their granularities by prompting • Spectral Feature Compression (SFC) • Encoder/decoder to handle TF domain features efficiently uFuture directions • Large-scale unsupervised pre-training, TUSS with input universality 47

Unified Audio Source Separation (Defense Slides)

Unified Audio Source Separation (Defense Slides)

Other Decks in Research

Featured

Transcript