information, thus this work uses a transformer based to overcome this. Two separate transformer for style and content encoding are used. The decoder uses a transformer to stylize the content. Related Works Gatys(2016) used CNN to extract features from content and stylized images and optimized the algorithm to generate stylized images. Huang(2017) proposed adaptive instance normalization that can stylize image by replacing the mean and variance of content image with that of stylized exemplar. Chen(2021) proposed Internal-External Style transfer algorithm optimized using two types of contrastive loss to improve results quality. Proposed Methodology Given content image(𝐼𝑐 ) and style image (𝐼𝑠 ) they are split into patches and a linear layer is used to obtain embeddings(𝜀) of size 𝐿 × 𝐶 where 𝐿 = 𝐻×𝑊 𝑚×𝑚 for m=8 patch size and C being embedding dimension. The attention score for 𝑖𝑡ℎ and 𝑗𝑡ℎ patch is calculated as 𝐴𝑖,𝑗 = ( 𝜀𝑖 + 𝑃𝑖 𝑊 𝑞 )𝑇( 𝜀𝑗 + 𝑃 𝑗 𝑊𝑘 ) for P representing positional encoding, 𝑊 𝑞 and 𝑊𝑘 represents query and key weights, respectively. The positional embedding are formulated as𝑃𝐶𝐴 (𝑥, 𝑦) = σ 𝑘=0 𝑠 σ 𝑙=0 𝑠 (𝑎𝑘𝑙 𝑃𝐿 (𝑥𝑘 ,𝑦𝑙 )) where 𝑃𝐿 = ℱ𝑝𝑜𝑠 (𝐴𝑣𝑔𝑃𝑜𝑜𝑙𝑛×𝑛 (𝜀)) for ℱ𝑝𝑜𝑠 representing 1 × 1 convolution and s represents number of query, key 3 StyTr2:Image Style Transfer with Transformers Copyright © 2022 VIVEN Inc. All Rights Reserved and value are calculated as 𝑄 = 𝑍𝑐 𝑊 𝑞 , 𝐾 = 𝑍𝑐 𝑊𝑘 and 𝑉 = 𝑍𝑐 𝑊 𝑣 neighboring patches. Thus, for content image 𝑍𝑐 = {𝜀𝑐1 + 𝑃𝐶𝐴1 , 𝜀𝑐2 + 𝑃𝐶𝐴2 … 𝜀𝑐𝐿 + 𝑃𝐶𝐴𝐿 } and the multi-head attention is calculated as 𝐹𝑀𝑆𝐴 (𝑄, 𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛1 𝑄, 𝐾, 𝑉 , … , 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑛 𝑄, 𝐾, 𝑉 )𝑊0 with layer normalization applied after each block and followed by a skip connection. Similar encoding is done for style image except for positional encoding. Thus, the content encoder can be represented as 𝑌′𝑐 = ℱ𝑀𝑆𝐴 𝑄, 𝐾, 𝑉 + 𝑄 and 𝑌𝑐 = ℱ𝐹𝐹𝑁 𝑌′𝑐 + 𝑌′𝑐 . The decoder consists of two multihead attention layers and one residual layer for content input 𝑌′′𝑐 = {𝑌𝑐1 + 𝑃𝐶𝐴1 ,𝑌𝑐2 + 𝑃𝐶𝐴2 ,… 𝑌𝑐𝐿 + 𝑃𝐶𝐴𝐿 } and style image 𝑌𝑠 = {𝑌𝑠1 , 𝑌𝑠2 ,… 𝑌𝑠𝐿 } the query, key and value are calculated 𝑄 = 𝑌′′𝑐 𝑊 𝑞 , 𝐾 = 𝑌′′𝑠 𝑊𝑘 , 𝑉 = 𝑌′′𝑠 𝑊 𝑣 it can be formulated as 𝑋′′ = ℱ𝑀𝑆𝐴 𝑄, 𝐾, 𝑉 + 𝑄, 𝑋′ = ℱ𝑀𝑆𝐴 (𝑋′′ +