Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unified Audio Source Separation (Defense Slides)

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Unified Audio Source Separation (Defense Slides)

Avatar for Kohei Saijo

Kohei Saijo

March 31, 2026
Tweet

Other Decks in Research

Transcript

  1. Audio Source Separation uHumans possess an ability to understand complex

    acoustic environments • Can isolate and recognize desired sources flexibly and adaptively uSource separation: Separate a mixture of sounds into individual sources • Fundamental technology to implement an auditory scene analysis ability in machines 2
  2. Unified Source Separation uMost existing models: task-specific design uGoal: Task-unifying

    system 3 Fixed number of sources A single source granularity Only a few classes of sources Variable number of sources Multiple granularity w/ explicit control Broad classes of sources
  3. Table of Contents uUnified Source Separation Model • Transformer-based separation

    model (Chapter 3.2) • Unified source separation model based on prompting (Chapter 6.3) uEncoder/Decoder for Unified Source Separation Model • Cross-attention-based encoder/decoder for spectral feature compression (Chapter 3.3) uFuture Directions • Large-scale pre-training by unsupervised learning • Unified source separation with input universality 4
  4. Table of Contents uUnified Source Separation Model • Transformer-based separation

    model (Chapter 3.2) • Unified source separation model based on prompting (Chapter 6.3) uEncoder/Decoder for Unified Source Separation Model • Cross-attention-based encoder/decoder for spectral feature compression (Chapter 3.3) uFuture Directions • Large-scale pre-training by unsupervised learning • Unified source separation with input universality 5
  5. Motivation uDifferent applications (or tasks) have different target sources of

    interest • Task-specific models needs to be deployed for each application, which is not efficient uNN’s powerful modeling capability may enable unifying all separation tasks • LLMs handles various tasks that were originally handled by specialist models • To address all separation tasks, the model needs to handle (i) arbitrary classes of and (ii) a variable number of sources, with (iii) an explicit control of granularity 6 Task Sources of interest Speech enhancement (SE) Speech, Noise Speech separation (SS) Speech × ", Noise Environmental sound separation (USS) Sound effects (SFX) × " Music source separation (MSS) Vocals, Bass, Drums, Other inst. Cinematic audio source separation (CASS) Speech, SFX-mix, Music-mix
  6. How Can We Build a Unified Source Separation Model? uRequirements

    to build a unified source separation model 1. A conditional model which can change its behavior in inference 2. A model that accepts a variable number of prompts 3. A model that accepts multiple identical prompts 7 Separation Model Mixture <Prompt 1> <Prompt !> ・・・ ・・・ ! prompts ! sources ・・・
  7. How Can We Satisfy the Requirements? uTransformer-based separation models can

    satisfy all the requirements 1. The model can change its behavior by e.g., prompting 2. Transformers work regardless of the input sequence length 3. Positional encoding makes prompts different from each other 8 Separation Model Mixture <Prompt 1> <Prompt !> ・・・ ・・・ ! prompts ! sources ・・・
  8. TF-domain Dual-Path Separation Models uBase of the current state-of-the-art separation

    model • Alternate sequence modeling along time and frequency dimensions • LSTM is very strong [Wang+, 2023], but would be great to have a Transformer alternative • Scalability, prompting, etc. 9 Frequency modeling Temporal modeling Conv2D + gLN STFT iSTFT Deconv2D Separated signals ! ∈ ℝ!×# Mixture $ ∈ ℝ# 2×%×& '×%×& 2×%×& ×" blocks # $ % # $ # $ # Frame 1 Frame 2 Frame ! $ ・・・ ・・・ Seq. modeling Seq. modeling Seq. modeling ・・・ # $ % [Wang+, 2023]: Z.-Q. Wang et al., “TF-GridNet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM TASLP, 2023.
  9. TF-Locoformer uTF domain Transformer with LOcal modeling by COnvolution •

    A design inspired by Conformer [Gulati+, 2020] and Transformer++ [Gu+, 2024] † models 10 [Gulati+, 2020]: A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," Interspeech, 2020. †: Existing model named at [Gu+, 2024]: A. Gu et al., "Mamba: Linear-time sequence modeling with selective state spaces." First conference on language modeling. 2024.
  10. Key Components of TF-Locoformer uConv-SwiGLU FFN • Convolutional layers inspired

    by Conformer • SwiGLU activation inspired by Transformer++ uMacaron-style architecture • Two FFNs before and after MHSA uRMSGroupNorm • Split !-dimensional vector into " groups and normalize each D/"-dimensional vector • This may encourage disentanglement of each source’s feature 11 # % $ # % # % $/' $/' ・・・ Normalize Normalize ・・・ # % $
  11. Speech Separation Experiments uDataset: WSJ0-2mix (anechoic) and WHMAR! (noisy reverberant)

    uMetric: SI-SDR [dB] uResults: • Comparable performance to LSTM-based model, TF-GridNet • Better performance with larger model 12 Model #params WSJ0-2mix WHAMR! TF-GridNet (S) 5.5 M - 17.1 TF-GridNet (M) 14.4 M 23.5 - TF-Locoformer (S) 5.0 M 22.0 17.4 TF-Locoformer (M) 15.0 M 23.6 18.5 TF-Locoformer (L) 22.5 M 24.2 -
  12. Unified Source Separation based on TF-Locoformer uTF-Locoformer satisfies the requirements

    to build prompting-based models • Sequence-length invariant • But can make prompts different from each other thanks to the positional encoding 13 Separation Model Mixture <Prompt 1> <Prompt !> ・・・ ・・・ ! prompts ! sources ・・・
  13. Prompts to be Considered u8 types of prompts • Speech:

    <Speech> • Sound effects: <SFX-mix> <SFX> • Music: <Music-mix> <Drums> <Bass> <Vocals> <Other inst.> uMajor tasks can be covered by changing the combination of prompts 14 Task Prompts Speech enhancement (SE) <Speech>, <SFX-mix> Speech separation (SS) <Speech> x N, <SFX-mix> Environmental sound separation (USS) <SFX> x N Music source separation (MSS) <Drums> <Bass> <Vocals> <Other inst.> Cinematic audio source separation (CASS) <Speech>, <SFX-mix>, <Music-mix>
  14. Task-aware Unified Source Separation (TUSS) [Saijo+, ICASSP2025] uA model that

    satisfies the requirements by using self-attention 15 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Decoder Decoder Prompts: <Speech>, …, <SFX-mix> ・・・ ・・・ Shared Shared Encoder ・・・ Mixture Speech SFX-mix
  15. 1. Encoder uSTFT-domain band-split module [Luo+, IEEE/ACM TASLP2023] • Applies

    STFT to the mixture waveform % ∈ ℝ#×$×% • Further encodes the spectrogram into ( ∈ ℝ&×'×% 16 ! " # Encoder Mixture [Luo+, 2023]: Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM TASLP, 2023.
  16. 2. Cross-prompt Module Mixes the prompts and the encoded feature

    via self-attention • The prompts and the encoded feature are conditioned on each other • Enables us to use a variable number of prompts and multiple identical prompts 17 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix ・・・ Prompts: <Speech>, …, <SFX-mix> ・・・ Encoder Mixture
  17. 3. Conditional TSE Module uProcesses each pair of a prompt

    and the feature one by one • Conditioning by element-wise product: (( = *( ⊙ ( ∈ ℝ&×'×% • Further applies some TF-Locoformer blocks • Variable number of prompts is acceptable as the TSE module is shared for all , 18 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Prompts: <Speech>, …, <SFX-mix> ・・・ ・・・ Shared Encoder ・・・ Mixture
  18. 4. Decoder uBand-wise decoding to project the separated feature back

    to STFT-domain • Inverse transformation of the encoder • Applied independently to each source 19 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module Decoder Decoder Prompts: <Speech>, …, <SFX-mix> ・・・ ・・・ Shared Shared Encoder ・・・ Mixture Speech SFX-mix
  19. Experimental Setup uTraining Data: on-the-fly mixing of audios from collection

    of public data • Randomly sample 2-4 prompts and the corresponding audio data, and mix them uValidation/testing data: public benchmarks for each task • Voicebank-DEMAND (SE), WHAM! (SS), FUSS (USS), MUSDB-HQ (MSS), DnR (CASS) 20 Category Datasets Speech VCTK, WSJ, LibriVox SFX FSD50K SFX-mix WHAM! DEMAND, FSD50K Music Inst. MUSDB-HQ, MOISESDB Music-mix FMA, MUSDB-HQ, MOISESDB
  20. Unified Source Separation Experiments uMethods: • Unconditional: unconditional separation model

    with fixed number of outputs • TUSS: proposed TUSS-based conditional model uMetrics: • SI-SDR [dB], except for MSS where we use SNR [dB] uResults: • Prompting-based unified model outperforms the unconditional model • Successfully handled tasks that require different #sources and granularity 21 Method SE SS USS MSS CASS Average Unconditional 14.0 6.8 8.1 4.9 6.0 8.0 TUSS 14.8 9.1 9.6 6.8 9.1 9.9
  21. Comparison Specialist Models uTwo types of models trained on different

    data • Specialist: a model trained on all the data for that task • TUSS: a model trained on all the data available uResults: • Unified model could not outperform the specialist models • Generally, larger model benefits from larger data • What if we size up the model? 22 Type SE SS USS MSS CASS Average Specialist 15.9 10.3 10.2 8.3 9.7 10.9 TUSS 14.8 9.1 9.6 6.8 9.1 9.9
  22. Results on Larger Model uTUSS achieved comparable performance to the

    specialist on several tasks • As expected, larger model benefits from large data • Showed its potential to serve as a foundation model for source separation 23 Model size Type SE SS USS MSS CASS Average Medium Specialist 15.9 10.3 10.2 8.3 9.7 10.9 TUSS 14.8 9.1 9.6 6.8 9.1 9.9 Large Specialist 16.0 11.4 10.0 9.1 10.0 11.3 TUSS 15.1 10.3 12.2 7.4 10.1 11.0
  23. Controllability at Inference 24 Speech SFX(-mix) Music-mix (vocals+other) <Speech>, <SFX-mix>,

    <Music-mix> <Speech>, <SFX>, <SFX>, <Music-mix> <Speech>, <SFX-mix>, <Vocals>, <Other>
  24. Controllability at Inference 25 <Speech>, <SFX>, <SFX>, <Vocals>, <Other> TUSS

    trained with up to 4 prompts shows successful separation even with 5 prompts. Speech SFX(-mix) Music-mix (vocals+other) <Speech>, <SFX-mix>, <Music-mix>
  25. Table of Contents uUnified Source Separation Model • Transformer-based separation

    model (Chapter 3.2) • Unified source separation model based on prompting (Chapter 6.3) uEncoder/Decoder for Unified Source Separation Model • Cross-attention-based encoder/decoder for spectral feature compression (Chapter 3.3) uFuture Directions • Large-scale pre-training by unsupervised learning • Unified source separation with input universality 26
  26. Challenge in TF-Locoformer: Computational Cost 27 uComputational cost is proportional

    to #frames ! and #bins " • - can be large (e.g., 1025) when sampling rate is high • Music source separation, • Cinematic audio source separation, etc. • Unified source separation system needs to handle such data Encoder Separator ・ ・ ・ Input Outputs ! " #×"×! ! " Decoder TF-domain dual-path models #×"×! ! " ! " Frequency modeling Temporal modeling ×% "
  27. Band-split (BS) encoder/decoder [Luo+, 2023] uSubband-wise encoding/decoding with # sub-encoders/sub-decoders

    28 [Luo+, 2023]: Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE/ACM TASLP, 2023. Encoder Separator ・ ・ ・ Input Outputs ! " #×%×! ! " Decoder Split ・ ・ ・ Enc. ! Enc. " Enc. # ・ ・ ・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression Encoder Separator ・ ・ ・ Input Outputs ! " #×%×! ! " Decoder Split ・ ・ ・ Enc. ! Enc. " Enc. # ・ ・ ・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression
  28. Psychoacoustically Motivated Inductive Bias uBand-splitting with psychoacoustic knowledge 29 Figure

    from: K. N. Watcharasupat, et. al., “A generalized bandsplit neural network for cinematic audio source separation,” IEEE OJSP, 2023. Encoder Separator ・ ・ ・ Input Outputs ! " #×%×! ! " Decoder Split ・ ・ ・ Enc. ! Enc. " Enc. # ・ ・ ・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression
  29. Inherent Limitations of BS Module 1. Encoding/decoding process (linear layer

    or MLP) is not input-adaptive • It cannot leverage input-dependent information 2. Limited receptive field • It inherently limits the receptive field to incorporate inductive bias 3. Large parameter counts • Total #params. of the encoder and decoder is twice more than that of the separator 30 Encoder Separator ・ ・ ・ Input Outputs ! " #×%×! ! " Decoder Split ・ ・ ・ Enc. ! Enc. " Enc. # ・ ・ ・ Merge Band-split encoder #×%×! ! " #×1×! |(! |×! #×%×! |(" |×! #×1×! #×1×! |(# |×! ! % ! % Frequency modeling Temporal modeling ×) Band config ( (e.g., mel) " TF-domain dual-path models with spectral compression
  30. Compression by Sequence Modeling Goal: to design the encoder/decoder that

    satisfies 1. Input-adaptive encoding/decoding leveraging input-dependent information 2. Unlimited receptive field 3. Small parameter counts Sequence modeling satisfies all the requirements • Cross-attention with a query of length . (Perceiver IO [Jaegle+, 2022]) 31 Sequence modeling Query ! " #×%×! #′×%×! [Jaegle +, 2022]: A. Jaegle, et. al., “Perceiver IO: A General Architecture for Structured Inputs & Outputs,” in Proc. ICLR, 2022.
  31. Overall Separation Pipeline based on Perceiver IO uEncoder $: ℝ!×#

    → ℝ$×%×# • Encode the spectrogram into !-dimensional feature while compressing frequency dim. uDecoder (: ℝ$×%×# → ℝ!×# • Restore the original frequency resolution 32 Feature ! "ℰ ∈ ℝ!!×#×$ Query %ℰ ∈ ℝ!!×&×$ Encoder ℰ Conv2D + Norm Conv2D + Norm Mixture & ∈ ℝ'(×#×$ ! " Compressed Feature !ℰ ∈ ℝ!!×&×$ Cross Attention Key Query Value Separator Compressed Feature !" ∈ ℝ!!×&×$ Query %) ∈ ℝ!!×#×$ Deconv 2D ! Cross Attention Key Query Value Deconv 2D Feature ! "" ∈ ℝ!!×#×$ Decoder " Masks ' ( ∈ ℝ'(*×#×$ #-th time frame
  32. Preliminary Experiment: Band-split vs. Cross-attention uCross-attention (pink) is much worse

    than the band-split module (purple) • Likely because cross-attention does not leverage any inductive bias 33
  33. Positional Bias to Incorporate Inductive Bias 34 uBand-split • The

    /-th encoder/decoder is in charge of the /-th band uCross-attention • The /-th query is in charge of to /-th band • Needs to learn where the /-th should attend, but the model failed uSolution • Incorporate psychoacoustically motivated inductive bias in cross-attention • Reformulate CA with a positional bias * ∈ ℝ)"×)#$ [Press+, 2022] CrossAttention(:, <, =): = softmax :<* ! + * = [Press+, 2022]: O. Press, et. al., “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.
  34. Positional Bias as Attention Mask 35 uCA with a positional

    bias ) ∈ ℝ&!×&"# E = softmax :<* ! + * = uWe can enforce the +-th query to attend to the +-th band • But this limits the receptive field Encoder +ℰ , - -∞ -∞ -∞ -∞ 0 0 0 0 -∞ -∞ 0 0 -∞ -∞ -∞ -∞ 0 0 -∞ -∞ -∞ -∞ -∞ -∞ Band-split config . = [ 1, 3 , 4, 5 , [6, 7, 8, 9]] % = 1 % = 2 % = 3 -∞ -∞ 0 -∞ -∞ 0 -∞ -∞ 0 -∞ -∞ 0 -∞ 0 -∞ -∞ 0 -∞ 0 -∞ -∞ 0 -∞ -∞ , Decoder +" -
  35. Design of Positional Bias 36 uCA with a positional bias

    ) ∈ ℝ&!×&"# E = softmax :<* ! + * = u) is designed to encourage the +-th query to attend to the +-th band • This does not limit the receptive field Encoder +ℰ , - -4 -3 -2 -1 -0.5 0 0 -0.5 -2 -1 0 0 -1 -2 -3 -4 0 0 -1 -2 -3 -4 -5 -6 % = 1 % = 2 % = 3 -6 -4 -0.5 -5 -3 0 -4 -2 0 -3 -1 -0.5 -2 0 -1 -1 0 -2 0 -1 -3 0 -2 -4 Decoder +" - Band-split config . = [ 1, 3 , 4, 5 , [6, 7, 8, 9]]
  36. Visualization of Positional Bias uEncoder’s positional bias based on the

    Musical band config [Watcharasupat+, 2023] • Showing the positional bias after applying softmax for better visualization 37 [Watcharasupat+, 2023]: K. N. Watcharasupat, et. al., “A generalized bandsplit neural network for cinematic audio source separation,” IEEE OJSP, 2023.
  37. Final Form of Proposed SFC-CA uCA-based Spectral Feature Compression (SFC-CA)

    with positional bias • Positional bias introduces an inductive bias, analogous to the BS module • Built upon the Perceiver IO framework but specifically designed for compressing frequency information 38 Feature ! "ℰ ∈ ℝ!!×#×$ Query %ℰ ∈ ℝ!!×&×$ Encoder ℰ Conv2D + Norm Conv2D + Norm Mixture & ∈ ℝ'(×#×$ ! " Compressed Feature !ℰ ∈ ℝ!!×&×$ Cross Attention Key Query Value Positional bias 'ℰ ∈ ℝ&×# Separator Compressed Feature !" ∈ ℝ!!×&×$ Query %) ∈ ℝ!!×#×$ Deconv 2D ! Cross Attention Key Query Value Positional bias ') ∈ ℝ#×& Deconv 2D Feature ! "" ∈ ℝ!!×#×$ Decoder " Band config # = Posi Encoder 0ℰ ∈ ℝ#× " ! -0.5 -1 -2 -3 -4 -1 0 0 -1 -2 -3 -2 -1 0 0 Masks ( ) ∈ ℝ'(*×#×$ 3-th time frame
  38. Preliminary Experiments on MSS and CASS uEncoder/decoder setup: , =

    1025, 3 = 64 (Musical band config, p29) uMSS: SNR [dB] on MUSDB18HQ uCASS: SNR [dB] on DnR 39 Model Enc/dec Params Vocals Bass Drums Other Avg. TF-Loco. (S) Band-split 34.7M 9.0 8.2 9.9 5.9 8.3 SFC-CA 5.8M 9.6 8.7 10.8 6.7 9.0 TF-Loco. (M) Band-split 55.5M 9.6 8.9 10.4 6.2 8.8 SFC-CA 16.0M 10.2 9.2 11.1 7.1 9.4 Model Enc/dec Params Speech Music SFX Avg. TF-Loco. (S) Band-split 34.7M 15.6 8.8 9.8 11.4 SFC-CA 5.8M 15.9 9.3 10.2 11.8 TF-Loco. (M) Band-split 55.5M 16.1 9.4 10.3 11.9 SFC-CA 16.0M 16.4 9.7 10.6 12.2
  39. Integrating SFC into TUSS uReplacing the BS encoder/decoder with SFC

    40 ! " # Cross-prompt module Speech Learnable prompts SFX SFX-mix Drums Bass Vocals Other-inst. Music-mix Speech SFX-mix Conditional TSE module ・・・ Conditional TSE module SFC-Decoder SFC-Decoder Prompts: <Speech>, …, <SFX-mix> ・・・ ・・・ Shared Shared SFC-Encoder ・・・ Mixture Speech SFX-mix
  40. Unified Source Separation Experiments uModel: TUSS medium (" = 6789)

    • BSRoformer’s band config: . = 61, without inter-band overlap • Musical band config: . = 64, with inter-band overlap uEvaluation on SI-SDR [dB] (except for MSS where SNR [dB] is used): • SFC-CA performs consistently better than the Band-split module 41 Enc/Dec Band config SE SS USS MSS CASS Average Band-split* BSRoformer (61) 15.2 9.0 9.1 6.8 9.1 9.8 SFC-CA BSRoformer (61) 15.4 9.1 10.6 6.8 9.2 10.2 Band-split Musical (64) 14.8 8.6 8.3 6.8 8.8 9.5 SFC-CA Musical (64) 15.5 9.2 10.4 7.2 9.3 10.3 *: The result is different from the previous slides, since batch size is smaller (8->4)
  41. Table of Contents uUnified Source Separation Model • Transformer-based separation

    model (Chapter 3.2) • Unified source separation model based on prompting (Chapter 6.3) uEncoder/Decoder for Unified Source Separation Model • Cross-attention-based encoder/decoder for spectral feature compression (Chapter 3.3) uFuture Directions • Large-scale pre-training by unsupervised learning • Unified source separation with input universality 42
  42. Data Scarcity in Source Separation uSource separation faces data scarcity

    • A mixture and reference sources cannot be recorded at the same time • Collecting reference sources of diverse classes of sounds is challenging • Environmental sounds, music, etc. uTo address data scarcity, unsupervised learning is promising • Great if the method works on monaural data with unknown number of sources 43
  43. Self-Remixing uUnsupervised source separation by iterating separation and remixing •

    The model is trained to reconstruct the initial mixture from pseudo-mixtures • Enables unsupervised learning from unlabeled monaural mixtures 44
  44. Initialize Large-scale Unsupervised Pre-training for TUSS uWe can pre-train the

    backbone separation model with Self-Remixing • Training TUSS in an unsupervised manner is not easy, as it cannot control granularity • But we can first pre-train an unconditional model and use it to initialize TUSS 45 Separation Model Unsupervised Pre-training Large-scale Unlabeled data Separation Model Supervised Fine-tuning Labeled data TUSS framework
  45. Unified Source Separation with Input Universality 46 Variable number of

    sources Multiple granularity w/ explicit control Broad classes of sources Output Universality Variable number of microphones Multimodal prompts Various types of distortions Input Universality Universal Model
  46. Summary uGoal: Unified Source Separation • Separating a variable number

    and arbitrary classes of sources with controllability uApproaches • TF-Locoformer • A Transformer-based separation model which becomes the basis of TUSS • Task-aware Unified Source Separation (TUSS) • A framework to controls #sources and their granularities by prompting • Spectral Feature Compression (SFC) • Encoder/decoder to handle TF domain features efficiently uFuture directions • Large-scale unsupervised pre-training, TUSS with input universality 47