Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applications of Divergence Measures for Domain Adaptation in NLP

wing.nus
July 31, 2022

Applications of Divergence Measures for Domain Adaptation in NLP

Abstract:
Machine learning models that work under different conditions are important for their deployment. The absence of robustness to varying conditions has adverse effects in the real world: increasing cases of autonomous cars crashing, making adverse legal decisions against minorities, working effectively only for only major languages in the world. The lack of robustness arises because machine learning models trained under certain input distributions are not guaranteed to work under a different distribution. An important aspect of making them work under different distributions is to measure how different these two distributions are. The mathematical tool of domain divergence helps quantify the difference. Applying divergence measures in novel ways is an important avenue to make machine learning models useful.

In this talk, we take the journey to explore the different applications of divergence measures, with a special interest in adapting NLP models to new inputs that arise naturally. We first identify the different divergence measures that are used within Natural Language Processing (NLP) and provide a taxonomy. Further, we identify applications of divergences and make contributions along them: 1) \textit{Making Decisions in the Wild} -- help practitioners predict the performance drop of a model under a new distribution 2) \textit{Learning Representations} -- aligning source and target domain representations for novel applications 3) \textit{Inspecting the Internals of a model} - understand the inherent robustness of models under new distributions.

This talk will present a brief overview of two of these applications. For the first application, we performed a large-scale correlational study of different divergence measures with a decrease in model performance. We compare whether divergence measures based on traditional word-level distributions are more reliable than those based on contextual word representations from pretrained language models. Based on our study, we make appropriate recommendations for divergence measures that best predict performance drop. In the second application, we employ machine learning models that reduce divergence between two domains to enable us to generate sentences between domains. Further, we make enhancements to the model to produce sentences that satisfy certain linguistic constraints with downstream applications to domain adaptation in Natural Language Processing.

We also present an ongoing work that applies divergence methods in a parameter-efficient manner for domain adaptation in NLP. Our method follows a two-step process of first extracting domain-invariant representation by reducing divergence measures between two domains and then reducing task-specific loss on labeled data in the source domain and achieve this using adapters. We provide some future directions for domain adaptation in NLP.

Bio:

Abhinav Ramesh Kashyap is a fourth-year Ph.D. student advised by Prof Min-Yen Kan. His research is on Natural Language Processing. Specifically, he focuses on making NLP models robust under different domains, also called Domain Adaptation. He is also interested in Scholarly Document Processing which helps extract information from scholarly articles.

Seminar page: https://wing-nus.github.io/nlp-seminar/speaker-abhinav
YouTube Video recording: https://youtu.be/ycdG5bozFT0

wing.nus

July 31, 2022
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. Applications of Divergence Measures for Domain Adaptation in NLP Abhinav

    Ramesh Kashyap School of Computing, National University of Singapore
  2. 11 No Problems. Everything ran smoothly These are perfect for

    my height Bad boots. Cannot wear outside Great sunglasses. I like the quality No Problems. Smooth running Sunglasses are great Perfect fit Horrible boots PS (x, y) PS (x, y) Train Test
  3. 12 No Problems. Everything ran smoothly These are perfect for

    my height Bad boots. Cannot wear outside Great sunglasses. I like the quality No Problems. Smooth running Sunglasses are great Perfect fit Horrible boots PS (x, y) PS (x, y) Train Test
  4. 13 No Problems. Everything ran smoothly These are perfect for

    my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Absolute bad!! Awesomeee sunglasses!! Its absolute whack No problem boots y’all PS (x, y) PT (x, y) Train Test
  5. 14 No Problems. Everything ran smoothly These are perfect for

    my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Awesomeee sunglasses!! Its absolute whack No problem boots y’all Absolute Crap!! PS (x, y) PT (x, y) DOMAIN SHIFT
  6. 15 No Problems. Everything ran smoothly These are perfect for

    my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Crap!! Awesomeee sunglasses!! Its absolute whack No problem boots y’all PS (x, y) PT (x, y) DOMAIN SHIFT HOW MUCH IS THE DOMAIN SHIFT? - DIVERGENCE MEASURES ℋ-Divergence CORAL CMD Wasserstein PAD DIVERGENCE MEASURES
  7. 16 LEARNING TRANSFERABLE FEATURES 𝒮 𝒯 Divergence Divergence Divergence CLF

    Long et al, 2015 - Learning Transferable Features with Deep Adaptation Networks -ICML Conv1 Conv2 Conv3 Conv4 Frozen Fine-Tuned
  8. 17 How likely is a sentence , according to a

    LM trained on compared to a LM trained on t ∈ 𝒯 𝒮 𝒯 𝒯 Labelled Data Unlabelled Data 𝒮 The most popular approach in Machine Translation DATA SELECTION
  9. 21 Application of Divergence Measures Making Decisions in the Wild

    Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.
 NAACL’21 Kashyap, A.R et al.
 ACL’22 Kashyap, A.R et al.
 EACL’21 How can We Make Domain Adaptation More E ffi cient Under Review 
 EMNLP’22
  10. 23 Geometric Information 
 Theoretic Higher Order P norm Euclidean

    Manhattan Cosine F KL Div JS Div Power Renyi CE Wass Optimal Transport Moment Match MMD CMD Domain 
 Discriminator CORAL PAD TAXONOMY
  11. 28 Geometric Information Order Calculates the distance between vectors in

    a metric space Principles from information theory to quantify the difference in distributions Statistical measures that consider different order moments of random variables Moment Match MMD CMD Domain 
 Discriminator CORAL PAD How different are the mean (1st order) of P, Q How different are the covariance (2nd order) of P, Q TAXONOMY
  12. 29 Geometric Information 
 Theoretic Higher Order Calculates the distance

    between vectors in a metric space Principles from information theory to quantify the difference in distributions Statistical measures that consider different order moments of random variables Moment Match MMD CMD Domain 
 Discriminator CORAL PAD P norm Euclidean Manhattan Cosine F KL Div JS Div Power Renyi CE Wass Optimal 
 matching TAXONOMY
  13. 30 Application of Divergence Measures Making Decisions in the Wild

    Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.
 NAACL’21 Kashyap, A.R et al.
 ACL’22 Kashyap, A.R et al.
 EACL’21 How can We Make Domain Adaptation More E ffi
  14. 33 PT (x, y) Let’s annotate more data Repeat Expensive

    Time Consuming DECISION IN THE WILD
  15. 34 PT (x, y) Let’s annotate more data Repeat Expensive

    Time Consuming Can continue indefinitely DECISION IN THE WILD
  16. 35 PS (X, Y) Let’s annotate more data Given data

    from another domain PT (x, y) can we predict the drop in performance of a model Divergence measures are applied to predict drop-in-performance Information 
 Theoretic Renyi Div, POS - Van Asch and Dalemans, 2012 CE, Dependency Parsing - Ravi et al., 2008 Higher Order Measure PAD and others, El Sahar et al, 2020 DECISION IN THE WILD
  17. Applications - Decisions in the Wild 36 PS (X, Y)

    Let’s annotate more data Given data from another domain PT (x, y) can we predict the drop in performance our model experiences Divergence measures are applied to predict drop-in-performance Information 
 Theoretic Renyi Div, POS - Van Asch and Dalemans, 2012 CE, Dependency Parsing - Ravi et al., 2008 Higher Order Measure PAD and others, El Sahar et al, 2020 There are so many divergence measures. Which Divergence Measure best predicts the drop in performance? Does using Contextual Word Representations to calculate have advantages? DECISION IN THE WILD
  18. 37 Datasets Divergence Measures Information 
 Theoretic Higher Order Measure

    Geometric POS NER Sentiment Analysis 5 corpora from English World Tree Bank corpus 8 different corpora for NER Amazon Review Dataset with 5 categories Cos KL-Div, JS-Div, Renyi-Div PAD, Wasserstein, MMD (with different kernels), CORAL Method Fine tune DistilBERT model on source domain 𝒮 Performance Drop = Accuracy on Test data of - Accuracy on Test data of 𝒮 𝒯 Correlation of performance drop with the divergence between domains EMPIRICAL STUDY
  19. 38 Better Correlation = Better indicator of performance drop Divergence/Task

    POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance RESULTS
  20. 39 Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div

    0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance PAD is a reliable metric across different tasks JS-Div consistently provides good correlations across different tasks Divergence/Task RESULTS Better Correlation = Better indicator of performance drop
  21. Empirical Analysis 40 Better Correlation = Better indicator of performance

    drop Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance PAD is a reliable metric across different tasks JS-Div consistently provides good correlations across different tasks Compared to measures calculated from Contextual Word Representations, simple measures using frequency based distributions are still reliable indicators of performance drops
  22. 41 When there is clear separation, there is a good

    indication that divergence correlates well with performance drop Corr = 0.716 WHY ARE SOME MEASURES BETTER THAN OTHERS Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267
  23. 42 Divergence measures are a primary tool in domain adaptation

    They can be categorised as Geometric, Information theoretic, Higher-order Which measure to use for predicting drop in performance of the model ? PAD is a reliable indicator across all tasks JS-Div is easy to compute, based on simple word distributions. Use it. SUGGESTIONS Abhinav Ramesh Kashyap abhinavkashyap.io Devamanyu Hazarika devamanyu.com Min-Yen Kan www.comp.nus.edu.sg/~kanmy Roger Zimmermann www.comp.nus.edu.sg/~rogerz CONCLUSION
  24. 43 Application of Divergence Measures Making Decisions in the Wild

    Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.
 NAACL’21 Kashyap, A.R et al.
 ACL’22 Kashyap, A.R et al.
 EACL’21 How can We Make Domain Adaptation More E ffi
  25. So Different Yet So Alike! Constrained Unsupervised Text Style Transfer

    Abhinav Ramesh Kashyap*, Devamanyu Hazarika*, Min-Yen Kan, Roger Zimmermann, Soujanya Poria
  26. 46 Source Domain 𝒮 Just open the door! Could you

    please open the door for me Target Domain 𝒯
  27. 48 [1]: Di Jin et al, Deep Learning For Text

    Style Transfer, Computational Linguistics Journal 2022 Just open the door! x a Informal Could you please open the door x′  a′  Formal p(x′  |x, a) Definition of Text Style Transfer 1 Data-Driven Definition of Style 1 Style 1 𝒮 Style 2 𝒯 Link an attribute the corpus Topic Content Meta Information
  28. 49 Supervised Method Unsupervised Method Just open the door! Could

    you please open the door Requires parallel data Sequence to Sequence Neural Network Models Does not require parallel data Encoder Just open the door! Could you please open the door Decoder Encoder Decoder Latent Space z Manipulate the latent space to disentangle content and style* * Major approaches disentangle the style and content. There are other methods that do not disentangle Hard to obtain and not scalable Uses Data-Driven Definition of Style
  29. 50 would like to book a would like to book

    a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections
  30. 51 Intro would like to book a would like to

    book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections MAINTAINING CONSTRAINTS IS IMPORTANT BUT IGNORED
  31. 52 I really loved Murakami’s book Personal Pronoun Proper Noun

    Loved the movie Text Style Transfer I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints vs.
  32. 53 Intro Constraints need to be maintained after transfer The

    personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg I really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints vs.
  33. 54 Intro Constraints need to be maintained after transfer I

    really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer Smaller Le No Person No Proper I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints Similar Len Similar Per Domain ap Proper Nou vs. The personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg encψ decη ˜ z Target x tgt ̂ x trg ℒ ae ℒ cri ℒ adv ℒ con ℒ clf Tied encθ z criticξ encψ decϕ decη ˜ z Target Source ℝ x tgt ̂ x src ̂ x trg x src • We introduce a GAN-based seq2seq network that explicitly enforces such constraints • Two cooperative losses (the discriminator and the generator reduce the same loss) • Contrastive Loss - Brings sentences with similar constraints closer together and pushes sentences with different constraints far away CONTRARAE
  34. 55 Intro Constraints need to be maintained after transfer I

    really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer Smaller Le No Person No Proper I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints Similar Len Similar Per Domain ap Proper Nou vs. The personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg encψ decη ˜ z Target x tgt ̂ x trg ℒ ae ℒ cri ℒ adv ℒ con ℒ clf Tied encθ z criticξ encψ decϕ decη ˜ z Target Source ℝ x tgt ̂ x src ̂ x trg x src • We introduce a GAN-based seq2seq network that explicitly enforces such constraints • Two cooperative losses (the discriminator and the generator reduce the same loss) • Contrastive Loss - Brings sentences with similar constraints closer together and pushes sentences with different constraints far away • Classifier loss - A discriminative classifier identifies the constraints from latent space CONTRARAE
  35. 56 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE) The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Encoder Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓
  36. 57 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Decoder Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)
  37. 58 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 (0,1) The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Generator Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) Maps noise samples to a latent space gψ : 𝒩 (0,1) → ˜ 𝒵 ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)
  38. 59 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) Maps noise samples to a latent space gψ : 𝒩 (0,1) → ˜ 𝒵 Distinguishes the real vs generated representations min ψ max ξ 𝔼 z∼Pz [crcξ (z)] − 𝔼 ¯ z∼P¯ z [crcξ (˜ z)] Critic ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)
  39. 60 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 Critic ℒae (θ, ϕ) = 𝔼 z∼Pz [−log pϕ (x|z)] Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)
  40. 61 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 Critic ℒae (θ, ϕ) = 𝔼 Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ℒcrc (ξ) = − 𝔼 z∼Pz [crcξ (z)] + 𝔼 ¯ z∼P¯ z [crcξ (¯ z)] The Critic should succeed in Distinguishing real from fake ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)
  41. 62 encθ decϕ gψ criticξ z ˜ z x s

    ∼ 𝒩 (0,1) Critic ℒadv (θ, ψ) = 𝔼 z∼Pz [crcξ (z)] − 𝔼 ¯ z∼P¯ z [crcξ (¯ z)] The Generator and the encoder should fool the Critic ℒae (θ, ϕ) = 𝔼 Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ℒcrc (ξ) = − 𝔼 The Critic should succeed in
  42. 63 encθ z criticξ encψ decϕ decη ˜ z Target

    𝒮 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq
  43. 64 Method encθ z criticξ encψ decϕ decη ˜ z

    Target 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq
  44. 65 Method encθ z criticξ encψ decϕ decη ˜ z

    Target 𝒮 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq
  45. 66 Method encθ z criticξ encψ decϕ decη ˜ z

    Target 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq
  46. 67 Method encθ z criticξ encψ decϕ decη ˜ z

    Target 𝒮 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq
  47. 69 encθ z criticξ encψ decϕ decη ˜ z Target

    𝒯 Source 𝒮 ℒae ℒcri ℒadv ℝ xtgt ̂ xsrc ̂ xtrg xsrc Tied ℒcon (θ, ψ, ξ) = − 1 |P| log P ∑ j=1 e(zi ⋅zj ) ∑B∖{i} k=1 e(zi ⋅zk ) Given a sentence s ∈ Src Mine P sentences each from Src, Trg All other sentences in the batch are negatives are representations from the encoders or the last layer of the critic z We add it to both the encoder and the critic Similar Ideas in Kang et al, 2020 Kang et al, 2020 - Contragan: Contrastive Learning for Conditional Image Generation, NEURIPS CONTRASTIVE LOSS ℒcon
  48. 70 encθ z criticξ encψ decϕ decη ˜ z Target

    𝒯 Source 𝒮 ℒae ℒcri ℒadv ℝ xtgt ̂ xsrc ̂ xtrg xsrc Tied ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq ℒcon ℒclf ℒclf (θ, ϕ, ξ, δ) = − | 𝒞 | ∑ c=1 log (σ (lc) yc (1 − σ (lc)) 1−yc ) It might be hard to mine positive and negative instances We encourage the encoders and the critic to instead Reduce a classification loss | 𝒞 | Number of constraints per sentences lc Logits for the class c σ( . ) Sigmoid function Similar Ideas in ACGAN (Odena et al, 2017) Odena et al 2017: Conditional Image Synthesis with Auxiliary Classifier GANs, ICML
  49. 71 DATASETS METRICS YELP IMDB POLITICAL ACC FL SIM AGG

    Business reviews labelled as either positive and negative Movie reviews labelled as either positive or negative Facebook posts labelled with either the Republican or Democratic slant How well the sentence adheres to target domain ? How fluent is the sentence ? How semantically similar is the sentence to the source domain? Joint Metric at instance level
  50. 72 Yelp IMDB POLITICAL Model Sampling ACC FL SIM AGG

    ACC FL SIM AGG ACC FL SIM AGG DRG Greedy 67.4 54.5 43.6 16.7 56.5 44.3 54.1 14.4 61.3 35.7 38.7 8.8 ARAE Greedy 93.1 67.9 31.2 19.8 95.0 76.3 26.4 19.9 63.0 72.1 17.3 11.0 ARAE 
 
 +CLF +CONTRA Greedy 89.3 69.2 32.9 20.6 97.8 84.0 33.5 28.1 99.0 56.8 41.8 24.9 nucleus (p=0.6) 89.4 68.6 32.8 20.4 97.1 82.6 33.6 27.4 99.0 56.0 41.6 24.4 seq2seq OVERALL RESULTS Compared to DRG(Li et al.) and ARAE (Zhao et al.), our method has better aggregate for 3 different datasets Li et al., Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer, NAACL Zhao et al., Adversarially Regularized Autoencoders, ICML Regularizing the latent space, brings advantages to the overall quality of generated sentences
  51. 73 seq2seq seq2seq REMOVING LOSS ON GENERATOR AND CRITIC seq2seq

    Model ACC FL SIM AGG ARAE + CLF 95.0 83.2 34.2 27.5 -generator 96.2 87.2 31.3 26.7 -critic 94.9 84.4 30.8 25.5 seq2seq Adding the CLF loss improves the over all AGG score Mostly improves the SIM score
  52. 74 seq2seq seq2seq REMOVING LOSS ON GENERATOR AND CRITIC Model

    ACC FL SIM AGG ARAE + CLF 95.0 83.2 34.2 27.5 -generator 96.2 87.2 31.3 26.7 -critic 94.9 84.4 30.8 25.5 seq2seq Adding the CLF loss improves the over all AGG score Mostly improves the SIM score Model ACC FL SIM AGG ARAE + CONTRA 96.1 80.6 36.0 28.6 -generator 93.5 78.8 34.0 26.0 -critic 90.1 67.8 39.5 24.9 seq2seq Adding the CONTRA loss improves the over all AGG score Adding the CONTRA loss improves the ACC and FL score
  53. 75 Intro would like to book a would like to

    book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections CLF and CONTRA losses are complementary and necessary on CRITIC and GENERATOR to improve the AGG score
  54. 76 seq2seq MAINTAINING CONSTRAINTS LENGTH DESCRIPTIVE # DOMAIN SPECIFIC ATTRS

    Adding Cooperative losses helps in maintaining constraints DRG ARAE ARAEseq2seq ARAE +CLF seq2seq ARAE +CONTRA seq2seq ARAE +CLF + CONTRA seq2seq LENGTH is an easier constraint to maintain for most methods Syntactic attributes like DESCRIPTIVENESS and #DOMAIN SPECIFIC ATTRS are harder to maintain Improvements in AGG score does not mean constraints are maintained.
  55. 77 Intro would like to book a would like to

    book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie TWITTER Newspaper CONCLUSION Unsupervised Style Transfer Methods do not define what constraints are maintained We introduced two cooperative losses to ARAE to maintain constraints We improve the general quality of transferring sentences between domains In addition, we maintain the constraints between the domains in a better manner Abhinav Ramesh Kashyap abhinavkashyap.io Devamanyu Hazarika devamanyu.com Min-Yen Kan www.comp.nus.edu.sg/~kanmy Roger Zimmermann www.comp.nus.edu.sg/~rogerz Soujanya Poria sporia.info/
  56. 78 Application of Divergence Measures Making Decisions in the Wild

    Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.
 NAACL’21 Kashyap, A.R et al.
 ACL’22 Kashyap, A.R et al.
 EACL’21 How can We Make Domain Adaptation More E ffi cient Under Progress
  57. 80 Feature Extractor Domain Invariant Representations Have same feature distribution

    irrespective of whether data comes from source or the target domain Labelled Data Classifier on 𝒮 D 𝒮 (X) = D 𝒯 (Y) (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS
  58. 81 Feature Extractor Classifier on 𝒯 Domain Invariant Representations Have

    same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Ben David et al., 2010 - A theory of learning from different domains Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS
  59. 82 Feature Extractor Distribution Alignment Classifier on 𝒮 Domain Invariant

    Representations Have same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS
  60. 83 Feature Extractor Distribution Alignment Classifier on 𝒮 Domain Invariant

    Representations Have same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS
  61. 84 Intro would like to book a would like to

    book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections BUT ALL THE PARAMETERS OF THE FEATURE EXTRACTOR ARE UPDATED. CAN WE MAKE IT MORE EFFICIENT?
  62. 85 LEARNING DOMAIN INVARIANT REPRESENTATIONS doml = Wup ⋅ f(Wdown

    ⋅ hl ) + rl ⏟ Down and Up Projection Δl = div(hsrc l , htrg l ) ⏟ Reduce Divergence between 𝒟 𝒮 , 𝒟 𝒯 ℒdiv = L ∑ l=1 Δl
  63. 86 taskl = Wup ⋅ f(Wdown ⋅ doml ) +

    rl ℒtask = softmax_ce(Wtask ⋅ hL ) ⏟ Stacking with Domain Adapter ⏟ Task Classifier STACKING DOMAIN AND TASK ADAPTERS
  64. 87 Intro would like to book a would like to

    book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie TWITTER Newspaper HIGHLIGHTS We perform better than just learning a task adapter We perform closely to fully fine-tuned UDA methods, with only fraction of a cost Our task adapters are reusable between different domains Including adapters only in certain layers brings further savings
  65. 89 Intro DIALOG Future Works • CONTINUOUS ADAPTATION TO NEW

    DOMAINS Make Domain Adaptation work for a stream of domains Re-use / Compose information learnt from previous domains to learn on new domains • DOMAIN GENERALIZATION Assume no-access to target domain data for domain adaptation
  66. 91 Why are some measures better at predicting the drop

    in performance of the model? Do they capture the underlying domains in a better manner? Are different datasets different domains? Divergence/Tasks POS NER SA Cos -1.78 x 10-1 -2.49 x 10-1 -2.01 x 10-1 KL-Div - - - Js-Div -8.5 x 10-2 -6.4 x 10-2 2.04 x 10-2 Renyi Div - - - PAD - - - Wasserstein -2.11 x 10-1 -2.36 x 10-1 -1.70 x 10-1 MMD-RQ -4.11 x 10-2 -3.04 x 10-2 -1.70 x 10-2 MMD-Gaussian 4.25 x 10-5 2.37 x 10-3 -8.45 x 10-5 MMD-Energy -9.84 x 10-2 -1.14 x 10-1 -8.48 x 10-2 MMD-Laplacian -1.67 x 10-3 4.26 x 10-4 -1.08 x 10-3 CORAL -2.34 x 10-1 -2.78 x 10-1 -1.41 x 10-1 • Initially assume dataset-is-domain • Calculate Silhouette Coefficient • A positive score indicates good separation of clusters • If they are domains, there will be clear clusters in representation 
 space • A negative score indicates that majority of points in a cluster 
 should have belonged to the other cluster
  67. 92 Most of the scores are negative Divergence/Tasks POS NER

    SA Cos -1.78 x 10-1 -2.49 x 10-1 -2.01 x 10-1 KL-Div - - - Js-Div -8.5 x 10-2 -6.4 x 10-2 2.04 x 10-2 Renyi Div - - - PAD - - - Wasserstein -2.11 x 10-1 -2.36 x 10-1 -1.70 x 10-1 MMD-RQ -4.11 x 10-2 -3.04 x 10-2 -1.70 x 10-2 MMD-Gaussian 4.25 x 10-5 2.37 x 10-3 -8.45 x 10-5 MMD-Energy -9.84 x 10-2 -1.14 x 10-1 -8.48 x 10-2 MMD-Laplacian -1.67 x 10-3 4.26 x 10-4 -1.08 x 10-3 CORAL -2.34 x 10-1 -2.78 x 10-1 -1.41 x 10-1 A dataset-is-not-a-domain Data driven methods to define domains - Aharoni and Goldberg