Applications of Divergence Measures for Domain Adaptation in NLP

Applications of Divergence Measures for Domain Adaptation in NLP Abhinav
Ramesh Kashyap School of Computing, National University of Singapore

2 Machine Intelligence should co-exist with Humans

Intro 3 MACHINE INTELLIGENCE SHOULD CO-EXIST WITH HUMANS Machine Learning
Models need to be Robust

Intro 4 MACHINE INTELLIGENCE SHOULD CO-EXIST WITH HUMANS CARS CRASHING

Intro 5 MACHINE INTELLIGENCE SHOULD CO-EXIST WITH HUMANS SENDING UNDER-REPRESENTED
TO PRISON

Intro 6 MACHINE INTELLIGENCE SHOULD CO-EXIST WITH HUMANS LANGUAGE TECHNOLOGIES
WORKING FOR MAJOR LANGUAGES

7 TRAINING CLF

8 TRAINING CLF CLF TESTING

9 TRAINING ID

Intro 10 TRAINING ID TESTING ID

11 No Problems. Everything ran smoothly These are perfect for
my height Bad boots. Cannot wear outside Great sunglasses. I like the quality No Problems. Smooth running Sunglasses are great Perfect fit Horrible boots PS (x, y) PS (x, y) Train Test

my height Bad boots. Cannot wear outside Great sunglasses. I like the quality No Problems. Smooth running Sunglasses are great Perfect fit Horrible boots PS (x, y) PS (x, y) Train Test

my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Absolute bad!! Awesomeee sunglasses!! Its absolute whack No problem boots y’all PS (x, y) PT (x, y) Train Test

my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Awesomeee sunglasses!! Its absolute whack No problem boots y’all Absolute Crap!! PS (x, y) PT (x, y) DOMAIN SHIFT

my height Bad boots. Cannot wear outside Great sunglasses. I like the quality Crap!! Awesomeee sunglasses!! Its absolute whack No problem boots y’all PS (x, y) PT (x, y) DOMAIN SHIFT HOW MUCH IS THE DOMAIN SHIFT? - DIVERGENCE MEASURES ℋ-Divergence CORAL CMD Wasserstein PAD DIVERGENCE MEASURES

16 LEARNING TRANSFERABLE FEATURES 𝒮 𝒯 Divergence Divergence Divergence CLF
Long et al, 2015 - Learning Transferable Features with Deep Adaptation Networks -ICML Conv1 Conv2 Conv3 Conv4 Frozen Fine-Tuned

17 How likely is a sentence , according to a
LM trained on compared to a LM trained on t ∈ 𝒯 𝒮 𝒯 𝒯 Labelled Data Unlabelled Data 𝒮 The most popular approach in Machine Translation DATA SELECTION

18 𝒩 (0, I) Prior Distribution DATA GENERATION Real Data
Distribution

19 DATA GENERATION Reduce Divergence Between Distributions Thanks for coming
to the Media Lunch Talk.

MY PH.D. STUDY MAKES CONTRIBUTIONS ALONG DIFFERENT APPLICATIONS OF THESE
DIVERGENCE MEASURES 20

21 Application of Divergence Measures Making Decisions in the Wild
Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.  NAACL’21 Kashyap, A.R et al.  ACL’22 Kashyap, A.R et al.  EACL’21 How can We Make Domain Adaptation More E ffi cient Under Review   EMNLP’22

ORGANISING THE DIVERGENCE MEASURES 22

23 Geometric Information   Theoretic Higher Order P norm Euclidean
Manhattan Cosine F KL Div JS Div Power Renyi CE Wass Optimal Transport Moment Match MMD CMD Domain   Discriminator CORAL PAD TAXONOMY

24 Geometric Information  

27 Geometric Information   Theoretic Higher  

28 Geometric Information Order Calculates the distance between vectors in
a metric space Principles from information theory to quantify the difference in distributions Statistical measures that consider different order moments of random variables Moment Match MMD CMD Domain   Discriminator CORAL PAD How different are the mean (1st order) of P, Q How different are the covariance (2nd order) of P, Q TAXONOMY

29 Geometric Information   Theoretic Higher Order Calculates the distance
between vectors in a metric space Principles from information theory to quantify the difference in distributions Statistical measures that consider different order moments of random variables Moment Match MMD CMD Domain   Discriminator CORAL PAD P norm Euclidean Manhattan Cosine F KL Div JS Div Power Renyi CE Wass Optimal   matching TAXONOMY

Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.  NAACL’21 Kashyap, A.R et al.  ACL’22 Kashyap, A.R et al.  EACL’21 How can We Make Domain Adaptation More E ffi

31 Let’s annotate more data PS (x, y) PT (x,
y) DECISION IN THE WILD

32 PT (x, y) Let’s annotate more data Repeat Expensive
DECISION IN THE WILD

Time Consuming DECISION IN THE WILD

Time Consuming Can continue indefinitely DECISION IN THE WILD

35 PS (X, Y) Let’s annotate more data Given data
from another domain PT (x, y) can we predict the drop in performance of a model Divergence measures are applied to predict drop-in-performance Information   Theoretic Renyi Div, POS - Van Asch and Dalemans, 2012 CE, Dependency Parsing - Ravi et al., 2008 Higher Order Measure PAD and others, El Sahar et al, 2020 DECISION IN THE WILD

Applications - Decisions in the Wild 36 PS (X, Y)
Let’s annotate more data Given data from another domain PT (x, y) can we predict the drop in performance our model experiences Divergence measures are applied to predict drop-in-performance Information   Theoretic Renyi Div, POS - Van Asch and Dalemans, 2012 CE, Dependency Parsing - Ravi et al., 2008 Higher Order Measure PAD and others, El Sahar et al, 2020 There are so many divergence measures. Which Divergence Measure best predicts the drop in performance? Does using Contextual Word Representations to calculate have advantages? DECISION IN THE WILD

37 Datasets Divergence Measures Information   Theoretic Higher Order Measure
Geometric POS NER Sentiment Analysis 5 corpora from English World Tree Bank corpus 8 different corpora for NER Amazon Review Dataset with 5 categories Cos KL-Div, JS-Div, Renyi-Div PAD, Wasserstein, MMD (with different kernels), CORAL Method Fine tune DistilBERT model on source domain 𝒮 Performance Drop = Accuracy on Test data of - Accuracy on Test data of 𝒮 𝒯 Correlation of performance drop with the divergence between domains EMPIRICAL STUDY

38 Better Correlation = Better indicator of performance drop Divergence/Task
POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance RESULTS

39 Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div
0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance PAD is a reliable metric across different tasks JS-Div consistently provides good correlations across different tasks Divergence/Task RESULTS Better Correlation = Better indicator of performance drop

Empirical Analysis 40 Better Correlation = Better indicator of performance
drop Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267 No single measure is the best indicator of drop in performance PAD is a reliable metric across different tasks JS-Div consistently provides good correlations across different tasks Compared to measures calculated from Contextual Word Representations, simple measures using frequency based distributions are still reliable indicators of performance drops

41 When there is clear separation, there is a good
indication that divergence correlates well with performance drop Corr = 0.716 WHY ARE SOME MEASURES BETTER THAN OTHERS Divergence POS NER SA Cos 0.018 0.223 -0.012 KL-Div 0.394 0.384 0.715 Js-Div 0..407 0.484 0.709 Renyi Div 0.392 0.382 0.716 PAD 0.477 0.426 0.538 Wasserstein 0.378 0.463 0.448 MMD-RQ 0.248 0.495 0.614 MMD-Gaussian 0.402 0.221 0.543 MMD-Energy 0.244 0.447 0.521 MMD-Laplacian 0.389 0.273 0.623 CORAL 0.349 0.484 0.267

42 Divergence measures are a primary tool in domain adaptation
They can be categorised as Geometric, Information theoretic, Higher-order Which measure to use for predicting drop in performance of the model ? PAD is a reliable indicator across all tasks JS-Div is easy to compute, based on simple word distributions. Use it. SUGGESTIONS Abhinav Ramesh Kashyap abhinavkashyap.io Devamanyu Hazarika devamanyu.com Min-Yen Kan www.comp.nus.edu.sg/~kanmy Roger Zimmermann www.comp.nus.edu.sg/~rogerz CONCLUSION

Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.  NAACL’21 Kashyap, A.R et al.  ACL’22 Kashyap, A.R et al.  EACL’21 How can We Make Domain Adaptation More E ffi

So Different Yet So Alike! Constrained Unsupervised Text Style Transfer
Abhinav Ramesh Kashyap*, Devamanyu Hazarika*, Min-Yen Kan, Roger Zimmermann, Soujanya Poria

45 Source Domain 𝒮 Welcome to ACL 2022 Bienvenue à
l’ACL, 2022 Target Domain 𝒯

46 Source Domain 𝒮 Just open the door! Could you
please open the door for me Target Domain 𝒯

47 Source Domain 𝒮 Target Domain 𝒯 Image from toonify.photos

48 [1]: Di Jin et al, Deep Learning For Text
Style Transfer, Computational Linguistics Journal 2022 Just open the door! x a Informal Could you please open the door x′ a′ Formal p(x′ |x, a) Definition of Text Style Transfer 1 Data-Driven Definition of Style 1 Style 1 𝒮 Style 2 𝒯 Link an attribute the corpus Topic Content Meta Information

49 Supervised Method Unsupervised Method Just open the door! Could
you please open the door Requires parallel data Sequence to Sequence Neural Network Models Does not require parallel data Encoder Just open the door! Could you please open the door Decoder Encoder Decoder Latent Space z Manipulate the latent space to disentangle content and style* * Major approaches disentangle the style and content. There are other methods that do not disentangle Hard to obtain and not scalable Uses Data-Driven Definition of Style

50 would like to book a would like to book
a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections

51 Intro would like to book a would like to
book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections MAINTAINING CONSTRAINTS IS IMPORTANT BUT IGNORED

52 I really loved Murakami’s book Personal Pronoun Proper Noun
Loved the movie Text Style Transfer I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints vs.

53 Intro Constraints need to be maintained after transfer The
personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg I really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints vs.

54 Intro Constraints need to be maintained after transfer I
really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer Smaller Le No Person No Proper I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints Similar Len Similar Per Domain ap Proper Nou vs. The personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg encψ decη ˜ z Target x tgt ̂ x trg ℒ ae ℒ cri ℒ adv ℒ con ℒ clf Tied encθ z criticξ encψ decϕ decη ˜ z Target Source ℝ x tgt ̂ x src ̂ x trg x src • We introduce a GAN-based seq2seq network that explicitly enforces such constraints • Two cooperative losses (the discriminator and the generator reduce the same loss) • Contrastive Loss - Brings sentences with similar constraints closer together and pushes sentences with different constraints far away CONTRARAE

55 Intro Constraints need to be maintained after transfer I
really loved Murakami’s book Personal Pronoun Proper Noun Loved the movie Text Style Transfer Smaller Le No Person No Proper I absolutely enjoyed Spielberg’s direction Text Style Transfer + Constraints Similar Len Similar Per Domain ap Proper Nou vs. The personal pronoun I is maintained The number of proper nouns are maintained Murakami & Spielberg encψ decη ˜ z Target x tgt ̂ x trg ℒ ae ℒ cri ℒ adv ℒ con ℒ clf Tied encθ z criticξ encψ decϕ decη ˜ z Target Source ℝ x tgt ̂ x src ̂ x trg x src • We introduce a GAN-based seq2seq network that explicitly enforces such constraints • Two cooperative losses (the discriminator and the generator reduce the same loss) • Contrastive Loss - Brings sentences with similar constraints closer together and pushes sentences with different constraints far away • Classifier loss - A discriminative classifier identifies the constraints from latent space CONTRARAE

56 encθ decϕ gψ criticξ z ˜ z x s
∼ 𝒩 ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE) The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Encoder Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓

∼ 𝒩 The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Decoder Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)

∼ 𝒩 (0,1) The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Generator Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) Maps noise samples to a latent space gψ : 𝒩 (0,1) → ˜ 𝒵 ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)

∼ 𝒩 Encodes sentences (real) x 𝖾𝗇𝖼 θ : 𝒳 → 𝒵 𝗓 ∼ 𝖯 𝗓 The aim is to generate natural sentences Learn a representation space over a prior distribution( ) that mimics real distribution 𝒩 Reconstructs sentences from the latent 𝗉 ϕ ( 𝗑 | 𝗓 ) Maps noise samples to a latent space gψ : 𝒩 (0,1) → ˜ 𝒵 Distinguishes the real vs generated representations min ψ max ξ 𝔼 z∼Pz [crcξ (z)] − 𝔼 ¯ z∼P¯ z [crcξ (˜ z)] Critic ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)

∼ 𝒩 Critic ℒae (θ, ϕ) = 𝔼 z∼Pz [−log pϕ (x|z)] Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)

∼ 𝒩 Critic ℒae (θ, ϕ) = 𝔼 Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ℒcrc (ξ) = − 𝔼 z∼Pz [crcξ (z)] + 𝔼 ¯ z∼P¯ z [crcξ (¯ z)] The Critic should succeed in Distinguishing real from fake ADVERSARIALLY REGULARIZED AUTOENCODER (ARAE)

∼ 𝒩 (0,1) Critic ℒadv (θ, ψ) = 𝔼 z∼Pz [crcξ (z)] − 𝔼 ¯ z∼P¯ z [crcξ (¯ z)] The Generator and the encoder should fool the Critic ℒae (θ, ϕ) = 𝔼 Loss to reconstruct sentences Encourage copying behaviour Maintains Semantic Similarity ℒcrc (ξ) = − 𝔼 The Critic should succeed in

63 encθ z criticξ encψ decϕ decη ˜ z Target
𝒮 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq

64 Method encθ z criticξ encψ decϕ decη ˜ z
Target 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq

Target 𝒮 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq

Target 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq

Target 𝒮 𝒯 ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq

Target 𝒯

𝒯 Source 𝒮 ℒae ℒcri ℒadv ℝ xtgt ̂ xsrc ̂ xtrg xsrc Tied ℒcon (θ, ψ, ξ) = − 1 |P| log P ∑ j=1 e(zi ⋅zj ) ∑B∖{i} k=1 e(zi ⋅zk ) Given a sentence s ∈ Src Mine P sentences each from Src, Trg All other sentences in the batch are negatives are representations from the encoders or the last layer of the critic z We add it to both the encoder and the critic Similar Ideas in Kang et al, 2020 Kang et al, 2020 - Contragan: Contrastive Learning for Conditional Image Generation, NEURIPS CONTRASTIVE LOSS ℒcon

𝒯 Source 𝒮 ℒae ℒcri ℒadv ℝ xtgt ̂ xsrc ̂ xtrg xsrc Tied ADVERSARIALLY REGULARIZED AUTOENCODER ( ARAE ) seq2seq ℒcon ℒclf ℒclf (θ, ϕ, ξ, δ) = − | 𝒞 | ∑ c=1 log (σ (lc) yc (1 − σ (lc)) 1−yc ) It might be hard to mine positive and negative instances We encourage the encoders and the critic to instead Reduce a classification loss | 𝒞 | Number of constraints per sentences lc Logits for the class c σ( . ) Sigmoid function Similar Ideas in ACGAN (Odena et al, 2017) Odena et al 2017: Conditional Image Synthesis with Auxiliary Classifier GANs, ICML

71 DATASETS METRICS YELP IMDB POLITICAL ACC FL SIM AGG
Business reviews labelled as either positive and negative Movie reviews labelled as either positive or negative Facebook posts labelled with either the Republican or Democratic slant How well the sentence adheres to target domain ? How fluent is the sentence ? How semantically similar is the sentence to the source domain? Joint Metric at instance level

72 Yelp IMDB POLITICAL Model Sampling ACC FL SIM AGG
ACC FL SIM AGG ACC FL SIM AGG DRG Greedy 67.4 54.5 43.6 16.7 56.5 44.3 54.1 14.4 61.3 35.7 38.7 8.8 ARAE Greedy 93.1 67.9 31.2 19.8 95.0 76.3 26.4 19.9 63.0 72.1 17.3 11.0 ARAE     +CLF +CONTRA Greedy 89.3 69.2 32.9 20.6 97.8 84.0 33.5 28.1 99.0 56.8 41.8 24.9 nucleus (p=0.6) 89.4 68.6 32.8 20.4 97.1 82.6 33.6 27.4 99.0 56.0 41.6 24.4 seq2seq OVERALL RESULTS Compared to DRG(Li et al.) and ARAE (Zhao et al.), our method has better aggregate for 3 different datasets Li et al., Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer, NAACL Zhao et al., Adversarially Regularized Autoencoders, ICML Regularizing the latent space, brings advantages to the overall quality of generated sentences

73 seq2seq seq2seq REMOVING LOSS ON GENERATOR AND CRITIC seq2seq
Model ACC FL SIM AGG ARAE + CLF 95.0 83.2 34.2 27.5 -generator 96.2 87.2 31.3 26.7 -critic 94.9 84.4 30.8 25.5 seq2seq Adding the CLF loss improves the over all AGG score Mostly improves the SIM score

74 seq2seq seq2seq REMOVING LOSS ON GENERATOR AND CRITIC Model
ACC FL SIM AGG ARAE + CLF 95.0 83.2 34.2 27.5 -generator 96.2 87.2 31.3 26.7 -critic 94.9 84.4 30.8 25.5 seq2seq Adding the CLF loss improves the over all AGG score Mostly improves the SIM score Model ACC FL SIM AGG ARAE + CONTRA 96.1 80.6 36.0 28.6 -generator 93.5 78.8 34.0 26.0 -critic 90.1 67.8 39.5 24.9 seq2seq Adding the CONTRA loss improves the over all AGG score Adding the CONTRA loss improves the ACC and FL score

book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections CLF and CONTRA losses are complementary and necessary on CRITIC and GENERATOR to improve the AGG score

76 seq2seq MAINTAINING CONSTRAINTS LENGTH DESCRIPTIVE # DOMAIN SPECIFIC ATTRS
Adding Cooperative losses helps in maintaining constraints DRG ARAE ARAEseq2seq ARAE +CLF seq2seq ARAE +CONTRA seq2seq ARAE +CLF + CONTRA seq2seq LENGTH is an easier constraint to maintain for most methods Syntactic attributes like DESCRIPTIVENESS and #DOMAIN SPECIFIC ATTRS are harder to maintain Improvements in AGG score does not mean constraints are maintained.

book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie TWITTER Newspaper CONCLUSION Unsupervised Style Transfer Methods do not define what constraints are maintained We introduced two cooperative losses to ARAE to maintain constraints We improve the general quality of transferring sentences between domains In addition, we maintain the constraints between the domains in a better manner Abhinav Ramesh Kashyap abhinavkashyap.io Devamanyu Hazarika devamanyu.com Min-Yen Kan www.comp.nus.edu.sg/~kanmy Roger Zimmermann www.comp.nus.edu.sg/~rogerz Soujanya Poria sporia.info/

Learning Representations Inspecting Internals of the Model Which Divergence Measure Best Predicts Performance Drops? How Can We Generate Sentences Across Domains How Robust are Representations from Large- Scale Language Models Kashyap, A.R et al.  NAACL’21 Kashyap, A.R et al.  ACL’22 Kashyap, A.R et al.  EACL’21 How can We Make Domain Adaptation More E ffi cient Under Progress

UDAPTER: Efficient Domain Adaptation Using Adapters Abhinav Ramesh Kashyap, Bhavitvya
Malik, Min-Yen Kan, Soujanya Poria

80 Feature Extractor Domain Invariant Representations Have same feature distribution
irrespective of whether data comes from source or the target domain Labelled Data Classifier on 𝒮 D 𝒮 (X) = D 𝒯 (Y) (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS

81 Feature Extractor Classifier on 𝒯 Domain Invariant Representations Have
same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Ben David et al., 2010 - A theory of learning from different domains Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS

82 Feature Extractor Distribution Alignment Classifier on 𝒮 Domain Invariant
Representations Have same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS

83 Feature Extractor Distribution Alignment Classifier on 𝒮 Domain Invariant
Representations Have same feature distribution irrespective of whether data comes from source or the target domain D 𝒮 (X) = D 𝒯 (Y) Labelled Data (xt )nt t=1 ∼ 𝒟 𝒯 Unlabelled data (xs , ys )ns s=1 ∼ 𝒟 𝒮 LEARNING DOMAIN INVARIANT REPRESENTATIONS

book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie Trump Trump TWITTER Social Media Newspaper is ousted in 2022 elections loses 2022 elections BUT ALL THE PARAMETERS OF THE FEATURE EXTRACTOR ARE UPDATED. CAN WE MAKE IT MORE EFFICIENT?

85 LEARNING DOMAIN INVARIANT REPRESENTATIONS doml = Wup ⋅ f(Wdown
⋅ hl ) + rl ⏟ Down and Up Projection Δl = div(hsrc l , htrg l ) ⏟ Reduce Divergence between 𝒟 𝒮 , 𝒟 𝒯 ℒdiv = L ∑ l=1 Δl

86 taskl = Wup ⋅ f(Wdown ⋅ doml ) +
rl ℒtask = softmax_ce(Wtask ⋅ hL ) ⏟ Stacking with Domain Adapter ⏟ Task Classifier STACKING DOMAIN AND TASK ADAPTERS

book a I restaurant I movie for 2 at for 2 at 7:00pm 7:00pm DIALOG Restaurant Movie TWITTER Newspaper HIGHLIGHTS We perform better than just learning a task adapter We perform closely to fully fine-tuned UDA methods, with only fraction of a cost Our task adapters are reusable between different domains Including adapters only in certain layers brings further savings

88 MY PH.D. STUDY MAKES CONTRIBUTIONS ALONG DIFFERENT APPLICATIONS OF
THESE DIVERGENCE MEASURES

89 Intro DIALOG Future Works • CONTINUOUS ADAPTATION TO NEW
DOMAINS Make Domain Adaptation work for a stream of domains Re-use / Compose information learnt from previous domains to learn on new domains • DOMAIN GENERALIZATION Assume no-access to target domain data for domain adaptation

SUPPLEMENTARY SLIDES 90

91 Why are some measures better at predicting the drop
in performance of the model? Do they capture the underlying domains in a better manner? Are different datasets different domains? Divergence/Tasks POS NER SA Cos -1.78 x 10-1 -2.49 x 10-1 -2.01 x 10-1 KL-Div - - - Js-Div -8.5 x 10-2 -6.4 x 10-2 2.04 x 10-2 Renyi Div - - - PAD - - - Wasserstein -2.11 x 10-1 -2.36 x 10-1 -1.70 x 10-1 MMD-RQ -4.11 x 10-2 -3.04 x 10-2 -1.70 x 10-2 MMD-Gaussian 4.25 x 10-5 2.37 x 10-3 -8.45 x 10-5 MMD-Energy -9.84 x 10-2 -1.14 x 10-1 -8.48 x 10-2 MMD-Laplacian -1.67 x 10-3 4.26 x 10-4 -1.08 x 10-3 CORAL -2.34 x 10-1 -2.78 x 10-1 -1.41 x 10-1 • Initially assume dataset-is-domain • Calculate Silhouette Coefficient • A positive score indicates good separation of clusters • If they are domains, there will be clear clusters in representation   space • A negative score indicates that majority of points in a cluster   should have belonged to the other cluster

92 Most of the scores are negative Divergence/Tasks POS NER
SA Cos -1.78 x 10-1 -2.49 x 10-1 -2.01 x 10-1 KL-Div - - - Js-Div -8.5 x 10-2 -6.4 x 10-2 2.04 x 10-2 Renyi Div - - - PAD - - - Wasserstein -2.11 x 10-1 -2.36 x 10-1 -1.70 x 10-1 MMD-RQ -4.11 x 10-2 -3.04 x 10-2 -1.70 x 10-2 MMD-Gaussian 4.25 x 10-5 2.37 x 10-3 -8.45 x 10-5 MMD-Energy -9.84 x 10-2 -1.14 x 10-1 -8.48 x 10-2 MMD-Laplacian -1.67 x 10-3 4.26 x 10-4 -1.08 x 10-3 CORAL -2.34 x 10-1 -2.78 x 10-1 -1.41 x 10-1 A dataset-is-not-a-domain Data driven methods to define domains - Aharoni and Goldberg

Applications of Divergence Measures for Domain...

Applications of Divergence Measures for Domain Adaptation in NLP

More Decks by wing.nus

Other Decks in Education

Featured

Transcript