we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . INXzy x y zxz YIX RHS and term ICYMI Iftiz I frizz RHS 3rd term Ift 2 1127 ILL p I 242 HZIL HHH 141272 H2 12 0 ZAKFitt It H Huston I 4 112,1 ICZL Iz214 I Z L Halt HLIZt z 7112127 41414 HAI zig 2h2 t 4 Z E NFHILEITHEEIFEE D TAEK I HT HH HAIL Z HEAD I TEKLE Lp I MAN t7112121 21412171 21 11214 2141212.4 2711217 Ellman's 1181741421 18441214 2141212,7 Exe I p er Ips 1 2 HUH 18441442112174 PIMM oleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands he objective |z,r)] I (r; ) ,r) I (r; ) . (8) is related to e describe a tion using a relationship ) divergence, ows, This inequality also applies to the mutual information I (r; ) in Eq.(8). Combining Eq.(11) and Eq.(12), we can rewrite Eq.(8) as follows, LDIB = I (z;x) (1 ) I (z; ) H ( |r) + H ( |z,r) I (r; ) kµ(x)k2 2 + (1 )H( |z) ( )H( |r) + H( |z,r). (13) Finally, we get a tractable solution for LDIB, ˆ LDIB = (1 )H( |z) | {z } (a) ( )H( |r) | {z } (b) + H( |z,r) | {z } (c) + kµ(x)k2 2 | {z } (d) , (14) where 0 < < < 1. Let ˆr , ˆz, and ˆz,c be the predicted labels generated by the biased component r, the unbiased component z, and the biased embedding vector [z,r] as shown in Figure 2(a), 14 z given Ͱ y ʹΔ ใখ͍ͨ͘͞͠ 0 < α < γ < 1 r given Ͱ y ʹΔ ใଟͯ͘Α͍ z,r given Ͱ y ʹΔ ใখ͍ͨ͘͞͠ z ѹॖ͍ͨ͠