variational autoencoder (VAE) is a powerful latent variable model for unsupervised representation learning. downstream applications (such as classification, data generation, out-of-distribution detection, etc.) π π π± π± π³ data encoder decoder π(π³) standard Gaussian prior data latent variable
the VAE cannot perform well with insufficient data points since it depends on neural networks. β’ To solve this, we focus on obtaining task-invariant latent variable from multiple tasks. π encoder π³ task-invariant latent variable multiple tasks useful for tasks of insufficient data points insufficient data points a lot of datapoints
multiple tasks, the conditional VAE (CVAE) is widely used, which tries to obtain task-invariant latent variable. π π π± π± π³ data encoder decoder π(π³) data task-invariant latent variable task index task index π π standard Gaussian prior
Although the CVAE can reduce the dependency of π³ on π to some extent, this dependency remains in many cases. β’ The contribution of this study is as follows: 1. We investigate the cause of the task-dependency in the CVAE and reveal that the simple prior is one of the causes. 2. We introduce the optimal prior to reduce the task-dependency. 3. We theoretically and experimentally show that our learned representation works well on multiple tasks.
E pD(x,s)q (z|x,s) [ln pβ(x|z, s)] E pD(x,s) [DKL(q (z|x, s)kp(z))] <latexit sha1_base64="R5iKmH4CEPmUQAzNPJKK3ZYfmTM=">AAADjXicnVHLbtNAFL2ueZTwSAobJDYRUSGRUDRBpSBEUQVCsOyDtJXiyBq743hUP6b2JGozzA+wYMuCFUgIIbZsYcOGH2DRT0Asi8SGBdeOqxQqqMq1PHPumXvO3JlxRMBTSciOMWEeO37i5OSp0ukzZ8+VK1PnV9K4n7is7cZBnKw5NGUBj1hbchmwNZEwGjoBW3U27mfrqwOWpDyOHsttwboh7UXc4y6VSNlTxlVhKyuk0nc8ZUmfSap1fY/Y0k/SxpzFI1k9pGwPDvW1tCHq47SRwyRU63pMzo2gox5oW23ucxY+3+c7HPtuZb7aCpgnO5aXUFf9d0P6SDsmvOfLbqlkV2qkSfKoHgStAtSgiIW48gYsWIcYXOhDCAwikIgDoJDi14EWEBDIdUEhlyDi+ToDDSXU9rGKYQVFdgPHHmadgo0wzzzTXO3iLgH+CSqrME2+kLdkl3wm78hX8vOvXir3yHrZxtkZaZmwy08vLv84VBXiLMEfq/7ZswQPbuW9cuxd5Ex2CnekHwyf7y7fXppWV8gr8g37f0l2yCc8QTT47r5eZEsvIHuA1p/XfRCsXG+2ZpszizO1+XvFU0zCJbgMdbzvmzAPj2AB2uAaz4z3xgfjo1k2b5h3zLuj0gmj0FyA38J8+AvILQEq</latexit> pβ(x|s) = Z pβ(x|z, s)p(z)dz = E q (z|x,s) ο£Ώ pβ(x|z, s)p(z) q (z|x, s) [Preliminaries] Reviewing CVAE β’ The CVAE models a conditional probability of π± given π as: β’ The CVAE is trained by maximizing the ELBO that is the lower bound of the log-likelihoods as follows: decoder prior encoder = β(π) data distribution
investigate the cause of dependency of π³ on π , we introduce the mutual information πΌ(π; π), which measures the mutual dependence between two random variables. πΌ π; π becomes large if π³ depends on π πΌ π; π becomes small if π³ does NOT depend on π π»(π) π»(π) π»(π) π»(π)
CVAE tries to minimize the mutual information πΌ(π; π) by minimizing its upper bound β(π): β’ However, β(π) is NOT a tight upper bound of πΌ(π; π) since π·!" (π# (π³)||π(π³)) usually gives a large value. <latexit sha1_base64="2WF4WdrdpOv468GFxtVHg/eznZs=">AAADZ3icjVFdaxNBFL2b+FHjR1IFKfgSDQ27VMJEioqlULSCpT60iUlLM+myu06SIfvV3UmwHecP+OBrBZ8URMSf4Yt/wIf+hOJjBV988GYTiLFWvcvuPXPuPWfvzNihy2NByIGWSp86febs1LnM+QsXL2Vz05frcdCLHFZzAjeINm0rZi73WU1w4bLNMGKWZ7tsw+4+GNQ3+iyKeeA/Ebsha3pW2+ct7lgCKXNa06hniY5jubKi9ATbLUnDDldGpkjZTo/3h6wtHypThqZcHvc9UzdjQ1GXtURj2ZSrj5W+Y8oJl3Hznno+oaN1Fol8+EvdMGjE2x3RpDRTXFzRqwtbxtx/+P7Rao7GPc+U3cWy2parioYcF2pF39yWetdQC1vDbGTMXIGUSBL546A8AgUYxVqQew8UnkIADvTAAwY+CMQuWBDj04AyEAiRa4JELkLEkzoDBRnU9rCLYYeFbBe/bVw1RqyP64FnnKgd/IuLb4TKPMySL+QDOSKfyUdySH6c6CUTj8Esu5jtoZaFZvbFTPX7P1UeZgGdseqvMwtowd1kVo6zhwkz2IUz1Pf39o+q9yqzskjekq84/xtyQD7hDvz+N+fdOqu8hsEFlH8/7uOgfqtUvl2aX58vLN0fXcUUXIMboON534EleARrUANH62gvtX3tVeownU1fTc8MW1PaSHMFJiJ9/Sf28+nG</latexit> R( ) β E pD(x,s) [DKL(q (z|x, s)kp(z))] = I(S; Z) + DKL(q (z)kp(z)) + K X k=1 β‘kI(X(k); Z(k)) mutual information between π± and π³ when π = π π! = π(π = π) π" π³ = β« π" π³ π±, π π# π±, s dπ±
!"# $ "! # $ ! ; & ! '$% (& ) β₯ + ) β - # .; & Proposed Method β(π) is NOT a tight upper bound of πΌ(π; π) since π·$% (π" (π³)||π(π³)) usually gives a large value. When π π³ = π" π³ , β(π) becomes the tightest upper bound of πΌ(π; π). β’ That is, the simple prior π(π³) is one causes of the task- dependency, and π# π³ is the optimal prior to reduce it.
ELBO with this optimal prior β±$%&'&() (π, π) is always larger than or equal to original ELBO β±*+,- (π, π): β’ That is, β±$%&'&() (π, π) is also a better lower bound of the log-likelihood than β±*+,- π, π . β’ This contributes to obtaining better representation for the improved performance on the target tasks. <latexit sha1_base64="cReRpIFFkHRHyAEW/aHr3JatyTY=">AAADWXicnZHPaxNBFMffZv0R1x+N9iJ4CYaWBEuYlKIiCNWqCHpIW5MWu2WZnU6SoftjOjsJtMuCV/sPePDUgoj4Z3jxH/BQ/AvEYwu9ePDtZrVqbQudZWfe+773efNmxpWeiDQhO0bBPHP23PniBevipctXRkpXr7WjsK8Yb7HQC9WiSyPuiYC3tNAeX5SKU9/1+IK7OpPGFwZcRSIMXuh1yZd92g1ERzCqUXJKe7ZPdY9RL36SOHHmKD9uqlCGEV9JkmomuZ3Y1j2uaTLx25c9kdSs+//lZ9oPHp/I3nrkxM+eJ9W1nPsVOOA2kprd5kqX5Z9SzbK7fK18+o0tp1QhdZKN8mGjkRsVyEczLL0HG1YgBAZ98IFDABptDyhE+C1BAwhI1JYhRk2hJbI4hwQsZPuYxTGDorqKcxe9pVwN0E9rRhnNcBcPf4VkGcbIF/KB7JLP5CP5Rn4cWSvOaqS9rOPqDlkunZHN6/P7J1I+rhp6B9SxPWvowN2sV4G9y0xJT8GG/GDjze78vbmxeJxsk+/Y/xbZIZ/wBMFgj72b5XNvIX2Axr/XfdhoT9Ybt+tTs1OV6Yf5UxThBtyEKt73HZiGp9CEFjDjpfHKeG1sFr6ahlk0rWFqwciZUfhrmKM/AQbF6Kc=</latexit> FProposed(β, ) = FCVAE(β, ) + DKL(q (z)kp(z)) FCVAE(β, )
β’ We optimize β±$%&'&() π, π = β±*+,- π, π + π·!" (π# (π³)||π(π³)) by approximating the KL divergence π·!" (π# (π³)||π(π³)): β’ We approximate π# π³ /π(π³) by density ratio trick, which can estimate the density ratio between two distributions using samples from both distribution (See Section 3.3). <latexit sha1_base64="PVz8Nq1rbUNMiC1/ST13WyzPjus=">AAADDHichVHLShxBFL3dUWPG15hsAm4GB2XcDDVGkhAiiHEh6MJHZhRsabrLGqewX1bXDGjRP+Ai2yyyUhARt+7ElRDyAy78hJClATcuvN3T4gvH21TXuafuuXWqyg4cHkpCLjT9VVt7x+vON5mu7p7evmz/20ro1wVlZeo7vli2rZA53GNlyaXDlgPBLNd22JK98S1eX2owEXLf+y63ArbqWuser3JqSaTM7MGUqWZmo8KmqQzXkjW7qoygxqOocJtuRyNGhQmZC+5TI8PjBvdkrrXO8YyqsKhqWRWpB52jBAtXrUV3bMbM5kmRJJF7CkopyEMac372AAxYAx8o1MEFBh5IxA5YEOK3AiUgECC3Cgo5gYgn6wwiyKC2jlUMKyxkN/C/jtlKynqYxz3DRE1xFweHQGUOhsg5OSSX5A85In/J9bO9VNIj9rKFs93UssDs23m/ePWiysVZQu1O1dKzhCp8Trxy9B4kTHwK2tQ3tn9eLn5ZGFLDZI/8Q/+75IKc4Qm8xn+6P88WfkH8AKXH1/0UVEaLpY/Fsfmx/MRk+hSdMACDUMD7/gQTMA1zUAaq9WgftK/auP5DP9ZP9NNmqa6lmnfwIPTfN8qWzOQ=</latexit> DKL(q (z)kp(z)) = Z q (z) ln q (z) p(z) dz
theoretical contributions are summarized as follows: β’ We next evaluate our representation on various datasets. β’ The simple prior is one of the causes of the task-dependency. β’ π! π³ is the optimal prior to reduce the task-dependency. β’ β±"#$%$&'(π, π) gives a better lower bound of the log-likelihood, which enables us to obtain better representation than the CVAE. Theorem 1 shows: Theorem 2 shows:
datasets, we conducted two-task experiments, which estimate the performance on the target tasks: β’ The source task has a lot of training data points. β’ The target task has only 100 training data points. β’ Pairs are (USPSβMNIST), (MNISTβUSPS), (SynthDigitsβSVHN), and (SVHNβSynthDigits). β’ On face datasets, we conducted three-task experiment, which simultaneously evaluates the performance on each task using a single estimator.
the CVAE are summarized as follows: β’ The simple prior is one of the causes of the task-dependency. β’ We propose the optimal prior to reduce the task-dependency. β’ Our approach gives a better lower bound of the log-likelihood, which enable us to obtain better representation than the CVAE. Theorem 1 shows: Theorem 2 shows: β’ Our approach achieves better performance on various datasets. Experiments shows: