Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Emtiyaz Khan (RIKEN, Tokyo, Japan) The Bayesia...

Jia-Jie Zhu
March 17, 2024
250

Emtiyaz Khan (RIKEN, Tokyo, Japan) The Bayesian Learning Rule

WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24

Berlin, Germany. March 11th - 15th, 2024

Jia-Jie Zhu

March 17, 2024
Tweet

More Decks by Jia-Jie Zhu

Transcript

  1. The Bayesian Learning Rule Mohammad Emtiyaz Khan RIKEN Center for

    AI Project, Tokyo http://emtiyaz.github.io 1 Summary of recent research at https://emtiyaz.github.io/papers/symposium_2023.pdf Slides available at https://emtiyaz.github.io/ Presentation at the Optimal Transport Workshop (Berlin) Mar 14 2024
  2. Fail because too slow or quick to adapt 5 h

    tt ps://www.youtube.com/watch?v=TxobtWAFh8o The video is from 2017
  3. Adaptation in Machine Learning • Even a small change may

    need retraining • Huge amount of resources are required only few can afford (costly & unsustainable) [1,2, 3] • Difficult to apply in “dynamic” settings (robotics, medicine, epidemiology, climate science, etc.) • Our goal is to solve such challenges – Help in building safe and trustworthy AI – To reduce “magic” in deep learning (DL) 6 1. Diethe et al. Continual learning in practice, arXiv, 2019. 2. Paleyes et al. Challenges in deploying machine learning: a survey of case studies, arXiv, 2021. 3. https://www.youtube.com/watch?v=hx7BXih7zx8&t=897s
  4. Bayesian Learning Rule [1] • Bridge DL & Bayesian learning

    [2-5] – SOTA on GPT-2 and ImageNet [5] • Improve other aspects of DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 7 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)
  5. GPT-2 with Bayesian Learning Rule 8 Better performance & uncertainty

    at the same cost [5] Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) BLR (IVON)[3]
  6. BLR for large deep networks 9 RMSprop/Adam BLR variant Improved

    Variational Online Newton (IVON) <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))
  7. Exponential Family 11 Expectation parameters Natural parameters Sufficient Statistics q(✓)

    / exp ⇥ >T(✓) ⇤ <latexit sha1_base64="Oeq3cXtocDvwDkSuRi7xorsSj8A=">AAACI3icbVBNTxsxFPSmkIa0lFCOXCwQEr1Eu62qVj2gSL30wCFIBJDibeT1vk0svGtjv0VEUX5Ib730r/SCECjiwoFbf0idj0rlYyRLo5l5tt8kRkmHYXgXVF4sLVdf1lbqr16vvllrrL89crq0AjpCK21PEu5AyQI6KFHBibHA80TBcXL6deofn4N1UheHODQQ57xfyEwKjl7qNb6c7TIcAPJ3lBmrDWrK4MIwBRl2KVP+ppR/Z6gNPfyXZFb2Bxj3GtthM5yBPiXRgmy3dvb/pD+qV+1eY8JSLcocChSKO9eNQoPxiFuUQsG4zkoHhotT3oeupwXPwcWj2Y5juuOVlGba+lMgnan/T4x47twwT3wy5zhwj72p+JzXLTH7HI9kYUqEQswfykpFfRPTwmgqLQhUQ0+4sNL/lYoBt1ygr7XuS4ger/yUHL1vRh+aHw98G3tkjhrZJFtkl0TkE2mRb6RNOkSQn+Q3uSY3wa/gMpgEt/NoJVjMbJAHCO7/Amqdp8o=</latexit> Gaussian distribution q(✓) := N(✓|m, S 1) <latexit sha1_base64="DKCbWsaIbOyC4lwnalpsd19I1Zg=">AAACFHicbVDLSgNBEJyNrxhfUY+CDAYhQQ27iiiCEvDiSSKaByQxzE4myZDZhzO9QljzEV4E/REvHhTx6sGbf+NskoMmFjQUVd10d9m+4ApM89uITUxOTc/EZxNz8wuLS8nllaLyAklZgXrCk2WbKCa4ywrAQbCyLxlxbMFKduc08ku3TCruuVfQ9VnNIS2XNzkloKV6cusmXYU2A5LBR8e46hBoUyLC895QvsPONr68DnesXqaeTJlZsw88TqwhSeXWHyM85evJr2rDo4HDXKCCKFWxTB9qIZHAqWC9RDVQzCe0Q1qsoqlLHKZqYf+pHt7USgM3PanLBdxXf0+ExFGq69i6M7pajXqR+J9XCaB5WAu56wfAXDpY1AwEBg9HCeEGl4yC6GpCqOT6VkzbRBIKOseEDsEafXmcFHez1l52/0KncYIGiKM1tIHSyEIHKIfOUB4VEEX36Bm9ojfjwXgx3o2PQWvMGM6soj8wPn8ASxmg0w==</latexit> Expectation parameters Natural parameters := {Sm, S/2} <latexit sha1_base64="fHjnr3/3r7/D+xibQRmAB8DwpFU=">AAACAXicbVDLSgMxFM3UV62vURcKboJF6ELrTEUqglBw47JS+4DOUDKZtA3NZIYkI5ShbvwAf8KNC0XcuvMT3Pkjrk0fC209EDiccy4393gRo1JZ1peRmptfWFxKL2dWVtfWN8zNrZoMY4FJFYcsFA0PScIoJ1VFFSONSBAUeIzUvd7l0K/fEiFpyG9UPyJugDqctilGSkstc9dhOuwjeH4BnaQSHMKjynEBOoOWmbXy1ghwltgTki3txB8PTu673DI/HT/EcUC4wgxJ2bStSLkJEopiRgYZJ5YkQriHOqSpKUcBkW4yumAAD7Tiw3Yo9OMKjtTfEwkKpOwHnk4GSHXltDcU//OasWqfuQnlUawIx+NF7ZhBFcJhHdCngmDF+pogLKj+K8RdJBBWurSMLsGePnmW1Ap5+yR/eq3bKIIx0mAP7IMcsEERlMAVKIMqwOAOPIJn8GLcG0/Gq/E2jqaMycw2+APj/QcLsphb</latexit> µ := {E q(✓), E q(✓✓>)} <latexit sha1_base64="+od0oFA4OIy6U3mBfmU4FcpPHws=">AAACKXicbVDLSgMxFM3UV62vqjvdBIvQgpQZRSqCUBDBZQX7gE4tmTRtQzMPkztCGbr2T9z4BX6BLtxYUNStP2L6WGjbAwmHc+7l3nucQHAFpvllxObmFxaX4suJldW19Y3k5lZJ+aGkrEh94cuKQxQT3GNF4CBYJZCMuI5gZadzPvDLd0wq7nvX0A1YzSUtjzc5JaClejJvuyE+PcN2ZLsE2o4TXfTqt2kb2gxI5gDPUEf/jQ1+kLF79WTKzJpD4GlijUkqv/OSve8/PxXqyb7d8GnoMg+oIEpVLTOAWkQkcCpYL2GHigWEdkiLVTX1iMtULRpe2sP7Wmngpi/18wAP1b8dEXGV6rqOrhwsria9gTjLq4bQPKlF3AtCYB4dDWqGAoOPB7HhBpeMguhqQqjkeldM20QSCjrchA7Bmjx5mpQOs9ZR9vhKp5FDI8TRLtpDaWShHMqjS1RARUTRA3pF7+jDeDTejE/je1QaM8Y92+gfjJ9fEDKq8A==</latexit> µ := E q[T(✓)] <latexit sha1_base64="LPBDEixPJmnwey2trRKnRMoK/Bk=">AAACCHicbVC7SgNBFJ2NrxhfUUsLR4MQm7CriCIogSBYRsgLskuYnUySIbMPZ+4qYUlhYaO1X2FjoYitn2Dn3zh5FJp44MLhnHu59x43FFyBaX4biZnZufmF5GJqaXlldS29vlFRQSQpK9NABLLmEsUE91kZOAhWCyUjnitY1e0WBn71hknFA78EvZA5Hmn7vMUpAS010tu2F+HTM2x7BDquG1/0G9d1XMra0GFA9p1GOmPmzCHwNLHGJJPfuRWPhae7YiP9ZTcDGnnMByqIUnXLDMGJiQROBeun7EixkNAuabO6pj7xmHLi4SN9vKeVJm4FUpcPeKj+noiJp1TPc3Xn4F416Q3E/7x6BK0TJ+Z+GAHz6WhRKxIYAjxIBTe5ZBRETxNCJde3YtohklDQ2aV0CNbky9OkcpCzDnNHVzqNczRCEm2hXZRFFjpGeXSJiqiMKLpHz+gVvRkPxovxbnyMWhPGeGYT/YHx+QMhgpvD</latexit> N(✓|m, S 1) / exp  1 2 (✓ m)>S(✓ m) <latexit sha1_base64="LNiRd3HCqoDWUiU+qQaNlQKVfGY=">AAACTnicbVFNaxsxFNQ6/XDdLzc9loJoKCQQm92UkBwDufQUUhInAWtjtPJbW0RaCeltqdnsH8lf6qWEXPIHeu+lhZbQynZa2qQDgmHmDdIbZVZJj3F8GTUW7ty9d7/5oPXw0eMnT9vPFg+8KZ2AnjDKuKOMe1CygB5KVHBkHXCdKTjMTran/uF7cF6aYh8nFlLNR4XMpeAYpEEbmOY4FlxVO/UywzEgP9Wre8dVJ6lXKLPOWDSUwQdLmYIc+7TDcsdFldTV2u8E7VC9cszQWLpH/2g65J0cjTGlg/ZS3I1noLdJck2Wtta/vfx8drG6O2ifs6ERpYYCheLe95PYYlpxh1IoqFus9GC5OOEj6AdacA0+rWZ11PR1UIY0Ny6cAulM/TtRce39RGdhcrq8v+lNxf95/RLzzbSShS0RCjG/KC8VDQ1Nu6VD6UCgmgTChZPhrVSMeWgLww+0QgnJzZVvk4O1bvKmu/4utLFB5miSF+QVWSYJ2SBb5C3ZJT0iyEfyhXwnP6JP0dfoKvo5H21E15nn5B80mr8AInq2zg==</latexit> / exp  (Sm)>✓ + Tr ✓ S 2 ✓✓> ◆ <latexit sha1_base64="H8MTJf1BF2SFrBrbM8JHnkTocKo=">AAACT3icbZHNThsxFIU9gRZIfwiwRJWsokpBbaMZEIIlEpsuQRBAykwjj3MnsWKPLfsOIhrNi/BMbFp10xfoA7BhQVXVmWRRoFey/encY9k+To0UDsPwZ9BYWHzxcml5pfnq9Zu3q6219XOnC8uhy7XU9jJlDqTIoYsCJVwaC0ylEi7S8dG0f3EF1gmdn+HEQKLYMBeZ4Ay91G9lsbHaoKYxXBsaS8iwR9unavtrjNoLOAJk9KMHuEaryjNb1ab25zizjJenVblTzW2zud4YWzEc4TadrQntt7bCTlgXfQ7RHLYO9+7f/br58em43/oeDzQvFOTIJXOuF4UGk5JZFFxC1YwLB4bxMRtCz2POFLikrPOo6AevDGimrR850lr9d0fJlHMTlXqnYjhyT3tT8X+9XoHZQVKK3BQIOZ8dlBWS+vym4dKBsMBRTjwwboW/K+Uj5nNC/wVNH0L09MnP4XynE+129k58GvtkVstkk7wnbRKRfXJIvpBj0iWc3JI78kB+B9+C++BPY25tBHPYII+qsfIXeJ24gg==</latexit> 1. Wainwright and Jordan, Graphical Models, Exp Fams, and Variational Inference Graphical models 2008 2. Malago et al., Towards the Geometry of Estimation of Distribution Algos based on Exp-Fam, FOGA, 2011
  8. Bayes and Conjugate Computations [1] 12 1. Khan and Lin,

    Conjugate computation variational inference, AISTATS, 2017. <latexit sha1_base64="nnfJIuxzsUOC+xV/awsH5wDVAB8=">AAACIHicbVDLSgMxFM34rPVVdekmWARXZUZEXRbc6K6CrUJnKJn0VkMzk5DcEcvQjf/hxi/wH0RwoYju9C/8A9PHwteBwOGce7k5J9ZSWPT9d29icmp6ZrYwV5xfWFxaLq2sNqzKDIc6V1KZs5hZkCKFOgqUcKYNsCSWcBp3Dwb+6SUYK1R6gj0NUcLOU9ERnKGTWqW9EOEKc60sghHK9GmojdKo6MiQouskFAnYsaKHY61S2a/4Q9C/JBiTcrX8+XBdv9uttUpvYVvxLIEUuWTWNgNfY5Qzg4JL6BfDzIJmvMvOoeloytzFKB8G7NNNp7RpRxn3UqRD9ftGzhJre0nsJhOGF/a3NxD/85oZdvajXKQ6Q0j56FAnk9TlH7RF28IAR9lzhHEj3F8pv2CGcdeWLboSgt+R/5LGdiXYrewcuzaOyAgFsk42yBYJyB6pkkNSI3XCyQ25J0/k2bv1Hr0X73U0OuGNd9bID3gfX1h/qU8=</latexit> posterior / lik ⇥ prior Bayes rule: Multiplication of distribution = addition of (natural) params <latexit sha1_base64="1iCgJvV81YG7AOljPaALLc+WRVM=">AAACLnicbZDLSgMxFIYz9VbrbVRwo4tgEQSxzIioG6Eggu4q2Au0pWbStA3NTIbkjFiGeRYfwI0+ii4EFXHrY5heFtr6Q+DnO+dwcn4vFFyD47xZqanpmdm59HxmYXFpecVeXStpGSnKilQKqSoe0UzwgBWBg2CVUDHie4KVve5Zv16+ZUpzGVxDL2R1n7QD3uKUgEEN+7wmTHOTNOIasDuIQ6khSfApHuOCdw3eG8eh4lIlScPOOjlnIDxp3JHJ5ve3blL3G0+Fhv1Sa0oa+SwAKojWVdcJoR4TBZwKlmRqkWYhoV3SZlVjA+IzXY8H5yZ4x5AmbkllXgB4QH9PxMTXuud7ptMn0NHjtT78r1aNoHVSj3kQRsACOlzUigQGifvZ4SZXjILoGUOo4uavmHaIIhRMwhkTgjt+8qQpHeTco9zhlUnjEg2VRptoG+0iFx2jPLpABVREFD2gZ/SOPqxH69X6tL6GrSlrNLOO/sj6/gExnK1b</latexit> post = lik + prior <latexit sha1_base64="aYxqies8sZhJDuXFT1LZihCQvHA=">AAACcnicbVHdShtBFJ5dbbXpj2uLIC2004aCgoRdKdZLoTf2TiExQnZdZidnzZDZnWHmrBiWfYA+R+8EX8a7PkVvfAAniRetyYGBj++HOfNNpqWwGIZ/PH9l9dnztfUXrZevXr/ZCDbfnllVGQ49rqQy5xmzIEUJPRQo4VwbYEUmoZ+Nf0z1/hUYK1TZxYmGpGCXpcgFZ+ioNPgFF3UsnX/I0jpGuMZaK4tNcxGj0rS7E+MIkO02NNZGaVR0MSDFeJkfRQF2iV0bocxiIA3aYSecDV0E0SNoH23e/L7dvktP0uAuHipeFVAil8zaQRRqTGpmUHAJTSuuLGjGx+wSBg6WzK2T1LPKGvrVMUOaK+NOiXTG/puoWWHtpMics2A4sk+1KblMG1SYHya1KHWFUPL5RXklqWtu2j8dCgMc5cQBxo1wu1I+YoZxdL/UciVET5+8CM72O9FB59upa+Mnmc86+UC+kB0Ske/kiByTE9IjnPz1tryP3ifv3n/vf/bbc6vvPWbekf/G33sASm/EYA==</latexit> e > post T (✓) / e > lik T (✓) ⇥ e > prior T (✓) This idea can be generalized through natural-gradients. <latexit sha1_base64="tWmmkq0DMSALFQRNtizGBSdmrIY=">AAACIXicbVC7SgNBFJ31bXwkamFhMyiCIIZdwUehErCxVDAPSEKYndzEIbM7y8xdMSz5FRs/wV+wsVAknVjb+g1OHmA0Hhg4nHMud+7xIykMuu67MzE5NT0zOzefWlhcWk5nVlYLRsWaQ54rqXTJZwakCCGPAiWUIg0s8CUU/dZ5zy/egjZChdfYjqAasGYoGoIztFItc1xBuMNEquZepAyCFkp36Cn9kaVodejuiBD1M7XMlpt1+6DjxBuSrdz611f65PHzspbpVuqKxwGEyCUzpuy5EVYTplFwCZ1UJTYQMd5iTShbGrIATDXpX9ih21ap04bS9oVI++roRMICY9qBb5MBwxvz1+uJ/3nlGBvH1USEUYwQ8sGiRiwpKtqri9aFBo6ybQnjWti/Un7DNOO2KpOyJXh/Tx4nhf2sd5g9uLJtnJEB5sgG2SQ7xCNHJEcuyCXJE07uyRN5Ia/Og/PsvDndQXTCGc6skV9wPr4BlnKoxg==</latexit> log-posterior = log-lik + log-prior <latexit sha1_base64="loPj4pEeiiE1Tk1sbZkGwlrqj8s=">AAACTXicbVHPaxQxFM6stT/W/ljtsZfYUihIl5lC1YtSKILHCm5b2BmGTObtNmwmGZM36hJy8Oy5/5QXwZv/hRcPioiZ3R5q2weBL9/7vpf3XopaCotx/D3q3Fu4v7i0vNJ9sLq2vtF7+OjU6sZwGHAttTkvmAUpFAxQoITz2gCrCglnxeS4zZ+9B2OFVm9xWkNWsbESI8EZBirvlakM4pLlLkX4iK7WFr2nL6hLZ8WdgdLTVLFCtpqqCZeK4UVRuFc+f+fpkM6NUo/3pZh4+uQaURuhjc/y3k7cj2dBb4PkCuwcPf4gPx9ffjrJe9/SUvOmAoVcMmuHSVxj5phBwSX4btpYqBmfsDEMA1SsApu5WcOe7gampCNtwlFIZ+x1h2OVtdOqCMp2Ensz15J35YYNjp5nTqi6QVB8/tCokRQ1bVdLS2GAo5wGwLgRoVfKL5hhHMMHdMMSkpsj3wanB/3kaf/wTdjGSzKPZbJFtskeScgzckRekxMyIJx8IT/IL/I7+hr9jP5Ef+fSTnTl2ST/RWfpHw7Gubk=</latexit> post = rµ E q[log-lik + log-prior] Natural gradient Posterior “approximation”
  9. Bayes Rule as (Natural) Gradient Descent 13 <latexit sha1_base64="qSWckS27uRuA/SiyfIpD7fur3K4=">AAACW3ichVHLShxBFK1ujU4mPiZK3OiiiAi6cOgWMW6EgSDE3QiOCt2dtrrmzkwx1V1t1W1xaPpb8g35lazMIr8SrHksfIEXCg7n3HOr7qkkl8Kg5/113Ln5DwuLtY/1T0vLK6uNz2uXRhWaQ4crqfR1wgxIkUEHBUq4zjWwNJFwlQy/j/WrO9BGqOwCRzlEKetnoic4Q0vFDR2mDAdJUp5W8S0NaIhwj6VU/X0phlVET2go7bQui8uZZOnqZ4gqp0+twcVuiANAtveep4gb217TmxR9DfwZ2G7tb924vzZ+t+PGn7CreJFChlwyYwLfyzEqmUbBJVT1sDCQMz5kfQgszFgKJion2VR0xzJd2lPangzphH3qKFlqzChNbOd4HfNSG5NvaUGBveOoFFleIGR8elGvkBQVHQdNu0IDRzmygHEt7FspHzDNONrvqNsQ/JcrvwaXB03/qHl4btM4I9OqkU3ylewSn3wjLfKDtEmHcPJA/juLTs355865dXdp2uo6M886eVbul0dX0rkr</latexit> E

    q[log-lik] = > lik E q[T(✓)] = > lik µ Expected log-lik and log-prior are linear in [1] μ Gradient wrt is simply the natural parameter μ <latexit sha1_base64="WAonMs+rXRscP32yi7QdZHNn6xU=">AAACK3icbVDLSgMxFM3Ud31VBTe6CBbBTcuMiLoRRBF0p2BV6AxjJk1raCYzJnfEMsy3uHXjwh9xoQsfuPU/zLQutPVA4HDOvUnOCWLBNdj2u1UYGh4ZHRufKE5OTc/Mlubmz3SUKMpqNBKRugiIZoJLVgMOgl3EipEwEOw8aO/n/vkNU5pH8hQ6MfNC0pK8ySkBI/mlPVeSQBA/dcMkw25I4CoI0oPMv8Z17AK7hVRErYrg7czDO9gV5upGPt6zjJz5pbJdtbvAg8T5IeXdyvJl4W7x8dgvPbuNiCYhk0AF0bru2DF4KVHAqWBZ0U00iwltkxarGypJyLSXdrNmeNUoDdyMlDkScFf9vZGSUOtOGJjJPIzu93LxP6+eQHPbS7mME2CS9h5qJgJDhPPicIMrRkF0DCFUcfNXTK+IIhRMvUVTgtMfeZCcrVedzerGiWnjCPUwjpbQClpDDtpCu+gQHaMaougePaFX9GY9WC/Wh/XZGy1YPzsL6A+sr28xhass</latexit> rµ E q[log-lik] = lik So Bayes’ rule can be written as (for an arbitrary q) <latexit sha1_base64="6U/f8yZ2xPtTjCeEcHXJMx/cJR8=">AAACSHicbVDLahRBFK1uH4nja1RwkywKgyBIhm6R6DIQBN1FyCSB6aa9XXN7Ukx1VVt1O3Fo+lvyDX6EG5fZ5RvcuFBCdqmeySImHig4nHMut+7JKyUdRdFpEN66fefu0vK93v0HDx897j95uutMbQUOhVHG7ufgUEmNQ5KkcL+yCGWucC+fbnX+3iFaJ43eoVmFaQkTLQspgLyU9bNE+fAYsiYh/EZNZRy1LU8UFgTWmiOeaMhV55e110uggzxvPrTZVz7iixllJutKTlv++opQWWlsm2b9tWgQzcFvkviSrG2ur34Jj59/3876J8nYiLpETUKBc6M4qihtwJIUCtteUjusQExhgiNPNZTo0mZeRMtfemXMC2P908Tn6tWJBkrnZmXuk90h7rrXif/zRjUV79NG6qom1GKxqKgVJ8O7VvlYWhSkZp6AsNL/lYsDsCDId9/zJcTXT75Jdt8M4o3B28++jU9sgWW2wl6wVyxm79gm+8i22ZAJ9oP9Yn/Y3+Bn8Ds4C84X0TC4nHnG/kEYXgDBSbbp</latexit> post rµ E q[log-lik + log-prior] 1. Khan, Variational-Bayes Made Easy, AABI 2023. As an analogy, think of least-square = 1-step of Newton <latexit sha1_base64="y7D+gn1jhrF35SzpSRvMhBcgsEc=">AAACN3icbVDLSgMxFM3UV62vquBGF0ERBLHMiKhLwY1uRMGq0Ck1k97R0MxkSO6oZZhv8RsEN/6GO924UMStf2DautDqgcDhnHO5uSdIpDDouk9OYWBwaHikOFoaG5+YnCpPz5wYlWoOVa6k0mcBMyBFDFUUKOEs0cCiQMJp0Nrt+KdXoI1Q8TG2E6hH7CIWoeAMrdQoH/jShpuskfkIN5glymCeU19CiExrdU37AlK0rL/aLydaKJ3njfKSW3G7oH+J902WdtYWzgu3c3eHjfKj31Q8jSBGLpkxNc9NsJ4xjYJLyEt+aiBhvMUuoGZpzCIw9ax7d06XrdKkodL2xUi76s+JjEXGtKPAJiOGl6bf64j/ebUUw+16JuIkRYh5b1GYSoqKdkqkTaGBo2xbwrgW9q+UXzLNONqqS7YEr//kv+RkveJtVjaObBv7pIcimSeLZIV4ZIvskD1ySKqEk3vyTF7Jm/PgvDjvzkcvWnC+Z2bJLzifX514sao=</latexit> post lik + prior
  10. Approximate Bayes 14 1. Zellner, Optimal information processing and Bayes’s

    theorem, The American Statistician, 1988. Posterior approximation (expo-family) Entropy <latexit sha1_base64="jX7nnuoTA97Rm3paM6fhC3APrMU=">AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg=</latexit> min q2Q E q(✓) [`(✓)] H(q) Generalized Approx Bayesian learning: log-lik + log-prior <latexit sha1_base64="nnfJIuxzsUOC+xV/awsH5wDVAB8=">AAACIHicbVDLSgMxFM34rPVVdekmWARXZUZEXRbc6K6CrUJnKJn0VkMzk5DcEcvQjf/hxi/wH0RwoYju9C/8A9PHwteBwOGce7k5J9ZSWPT9d29icmp6ZrYwV5xfWFxaLq2sNqzKDIc6V1KZs5hZkCKFOgqUcKYNsCSWcBp3Dwb+6SUYK1R6gj0NUcLOU9ERnKGTWqW9EOEKc60sghHK9GmojdKo6MiQouskFAnYsaKHY61S2a/4Q9C/JBiTcrX8+XBdv9uttUpvYVvxLIEUuWTWNgNfY5Qzg4JL6BfDzIJmvMvOoeloytzFKB8G7NNNp7RpRxn3UqRD9ftGzhJre0nsJhOGF/a3NxD/85oZdvajXKQ6Q0j56FAnk9TlH7RF28IAR9lzhHEj3F8pv2CGcdeWLboSgt+R/5LGdiXYrewcuzaOyAgFsk42yBYJyB6pkkNSI3XCyQ25J0/k2bv1Hr0X73U0OuGNd9bID3gfX1h/qU8=</latexit> posterior / lik ⇥ prior Bayes rule: Bayes as optimization [1], aka variational inference: <latexit sha1_base64="7XRNdYPNnjtrQk0vBznsSfe+xNc=">AAACJXicbVDLSgMxFM3Ud31VBTe6CBahIpYZEXXhQhBB0YWC1UJnqJk0raGZyZjcEcs43yK4ce1fuHGhiODKXzF9LLR6IHByzr0k5/iR4Bps+9PKDAwODY+MjmXHJyanpnMzs+daxoqyEpVCqrJPNBM8ZCXgIFg5UowEvmAXfnOv7V/cMKW5DM+gFTEvII2Q1zklYKRqbscNCFz5frKfVq8r2AV2C4mQjTXBm6mHV3vK0XFauMbuXe8aKS5VulLN5e2i3QH+S5weye+uLV5m7uefTqq5N7cmaRywEKggWlccOwIvIQo4FSzNurFmEaFN0mAVQ0MSMO0lnZQpXjZKDdelMicE3FF/biQk0LoV+GaynUn3e23xP68SQ33bS3gYxcBC2n2oHgsMErcrwzWuGAXRMoRQxc1fMb0iilAwxWZNCU5/5L/kfL3obBY3Tk0bh6iLUbSAllABOWgL7aIDdIJKiKIH9Ixe0Zv1aL1Y79ZHdzRj9Xbm0C9YX99ifqgR</latexit> E q[log-lik] + KL(qkprior) <latexit sha1_base64="jX7nnuoTA97Rm3paM6fhC3APrMU=">AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg=</latexit> min q2Q E q(✓) [`(✓)] H(q)
  11. The Bayesian Learning Rule 15 Posterior approximation (expo-family) vs Entropy

    1. Khan and Rue, The Bayesian Learning Rule, JMLR, 2023 2. Khan and Lin. "Conjugate-computation variational inference….” AIstats, 2017 <latexit sha1_base64="/+t3q4v2VzqDXbxQIVNEyo2KDJY=">AAACBnicbVDJSgNBEO2Je9xGPYrQGATFEGZE1JMIXjxGMAtkQujp1CRNenqG7hohBE968Fe8eFDEgxe/wZt/Y2c5uD0oeLxXRVW9MJXCoOd9Ormp6ZnZufmF/OLS8sqqu7ZeNUmmOVR4IhNdD5kBKRRUUKCEeqqBxaGEWtg7H/q1a9BGJOoK+yk0Y9ZRIhKcoZVa7lYQC9UKsAvIaFAMijQAKXfHwl7LLXglbwT6l/gTUjg7enu9i/Yfyy33I2gnPItBIZfMmIbvpdgcMI2CS7jJB5mBlPEe60DDUsViMM3B6I0bumOVNo0SbUshHanfJwYsNqYfh7YzZtg1v72h+J/XyDA6aQ6ESjMExceLokxSTOgwE9oWGjjKviWMa2FvpbzLNONok8vbEPzfL/8l1YOSf1Q6vLRpnJIx5skm2Sa7xCfH5IxckDKpEE5uyQN5Is/OvfPovDiv49acM5nZID/gvH8BjDWblg==</latexit> min ✓ `(✓) <latexit sha1_base64="jX7nnuoTA97Rm3paM6fhC3APrMU=">AAACWXicbVFdSxwxFM2MVddptWt99CVUhBXsMlNKlT4USxF8VHBV2AxDJnvXDWYys8mdwhLmf/k39KFQ+lf6YHZX8asXAueec+7NzU1eKWkxjv8E4cKbxaXl1kr09t3q2vv2+oczW9ZGQE+UqjQXObegpIYeSlRwURngRa7gPL/6OdXPf4GxstSnOKkgLfillkMpOHoqa1eskDpzY8qkpo7NGjoDg4ayguNIcOVOmsZnu2x3TuW5O2wy98w77jAcAfId7+wzUOohT+mnx0ZHTWe8k7W34m48C/oaJPdg6+DbTRRe/9g+ztq3bFCKugCNQnFr+0lcYeq4QSkUNBGrLVRcXPFL6HuoeQE2dbPhGrrtmQEdlsYfjXTGPq1wvLB2UuTeOZ3SvtSm5P+0fo3D/dRJXdUIWswvGtaKYkmna6YDaUCgmnjAhZF+VipG3HCB/jMiv4Tk5ZNfg7PP3eRr98uJ38Z3Mo8W2SQfSYckZI8ckCNyTHpEkN/kX7AYLAV/wyBshdHcGgb3NRvkWYQbd87xtyg=</latexit> min q2Q E q(✓) [`(✓)] H(q) Bayesian Learning Rule [1,2] (natural-gradient descent) Natural and Expectation parameters of q <latexit sha1_base64="3p+Algv/G/xbGOjIYDbmLOcc3zc=">AAACX3icbVHBbtQwEHVSoGVb2rRcQFwsKqTtgVUCiHIqFQipxyKxbaV1FE2cya5Vx0ltp3Rl5SP4CT6IGxIXLnwHzm6RoGUkW8/vvfGMx3kjhbFx/D0IV+7cvbe6dn+wvvFgcyva3jkxdas5jnkta32Wg0EpFI6tsBLPGo1Q5RJP8/P3vX56idqIWn2y8wbTCqZKlIKD9VQWXTLpzQVQJrG0oHX9mf6hnlOmZzVTkEvInGOLak5j0VFWtZ3f34kpc/4Adpbn7kOXXdAJQymHzM7Qwl7a39GrHKQ76oYXe8ucLot241G8CHobJNdg9/Dl16tHb7/8Os6ib6yoeVuhslyCMZMkbmzqQFvBJXYD1hpsgJ/DFCceKqjQpG7RcUefeaagZa39UpYu2L8zHFTGzKvcO/tmzU2tJ/+nTVpbvkmdUE1rUfFlobKV1Na0HzYthEZu5dwD4Fr4XimfgQZu/ZcM/BCSm0++DU5ejJLXo1cf/TQOyDLWyBPylAxJQvbJITkix2RMOPkRhMF6sBH8DFfDzTBaWsPgOuch+SfCx78BIYW4/Q==</latexit> ⇢rµ n E q[`(✓)] H(q) o Old belief New information = natural gradients Exploiting posterior’s information geometry to derive existing algorithms as special instances by approximating q and natural gradients.
  12. Warning! • This natural gradient is different from the one

    what we (often) encounter in machine learning for Maximum-Likelihood – In MLE, the loss is the negative log probability distribution – Here, loss and distribution are two different entities, even possible unrelated 16 min θ − log q(θ) ⇒ F(θ)−1 ∇log q(θ) min q 𝔼 q [ℓ(θ)] − ℋ(q) ⇒ F(λ)−1 ∇λ 𝔼 q [ℓ(θ)]
  13. 17 Bayesian learning rule: Abbreviations: cov. ! covariance, STE !

    Straight-Through-Estimator, VI ! Variational Inference, VMP ! Variational Message Passing. Learning Algorithm Posterior Approx. Natural-Gradient Approx. Sec. Optimization Algorithms Gradient Descent Gaussian (fixed cov.) Delta method 1.3 Newton’s method Gaussian —–“—– 1.3 Multimodal optimization (New) Mixture of Gaussians —–“—– 3.2 Deep-Learning Algorithms Stochastic Gradient Descent Gaussian (fixed cov.) Delta method, stochastic approx. 4.1 RMSprop/Adam Gaussian (diagonal cov.) Delta method, stochastic approx., Hessian approx., square-root scal- ing, slow-moving scale vectors 4.2 Dropout Mixture of Gaussians Delta method, stochastic approx., responsibility approx. 4.3 STE Bernoulli Delta method, stochastic approx. 4.5 Online Gauss-Newton (OGN) (New) Gaussian (diagonal cov.) Gauss-Newton Hessian approx. in Adam & no square-root scaling 4.4 Variational OGN (New) —–“—– Remove delta method from OGN 4.4 BayesBiNN (New) Bernoulli Remove delta method from STE 4.5 Approximate Bayesian Inference Algorithms Conjugate Bayes Exp-family Set learning rate ⇢t = 1 5.1 Laplace’s method Gaussian Delta method 4.4 Expectation-Maximization Exp-Family + Gaussian Delta method for the parameters 5.2 Stochastic VI (SVI) Exp-family (mean-field) Stochastic approx., local ⇢t = 1 5.3 VMP —–“—– ⇢t = 1 for all nodes 5.3 Non-Conjugate VMP —–“—– —–“—– 5.3 Non-Conjugate VI (New) Mixture of Exp-family None 5.4 2.1 Bayesian learning rule as natural-gradient descent
  14. Gradient Descent from BLR 18 Derived by choosing Gaussian with

    fixed covariance ⇢rµ (E q[`(✓)] H(q)) <latexit sha1_base64="ZgoUih72jNp2X1gy1YFrJC9GVEM=">AAACYHicbVFNa9RAGJ6kfqxra7d608toEXYPLklF2pMUitBjBbct7ITwZvJmd+jkozNvXJaQg2fP/jFvHrz4S5xkK2jrCwMPz/N+PpNUWlkKgh+ev3Xv/oOHg0fDx9s7T3ZHe0/PbVkbiTNZ6tJcJmBRqwJnpEjjZWUQ8kTjRXJ10ukXn9FYVRafaF1hlMOiUJmSQI6KRyuhXXIKXGjMCIwpV/wP9YYLsyy5KCDREDeN6Mc1BtOWi7xu275m7DDQMkmaD218zecCtR4LWiLBJOp6dKoE3Zy24+uJa6kWS5rEo/1gGvTB74LwBuwfv1zpryffvpzFo+8iLWWdY0FSg7XzMKgoasCQkhrboagtViCvYIFzBwvI0UZNv3HLXzsm5Vlp3CuI9+zfFQ3k1q7zxGV229rbWkf+T5vXlB1FjSqqmrCQm0FZrTmVvHObp8qgJL12AKRRblcul2BAkvuToTMhvH3yXXB+MA3fTt99dG68Z5sYsBfsFRuzkB2yY3bKztiMSfbT2/K2vR3vlz/wd/29Tarv3dQ8Y/+E//w3iHi5FA==</latexit> m m ⇢rm E q[`(✓)] <latexit sha1_base64="mKRRHs0Ncb8WZ6GVhpi6k093WMo=">AAACMnicbVBNaxRBFOyJUeP6kVWPXtoEIR5cZhJETxIIgrlFcJPA9jC86X2TbdIfY/cbwzLMwbPn/I5c/CWCh3hQxKs/wt7dHDSxoKGoeo9XXWWtVaA0PU+Wri1fv3Fz5Vbv9p2791b79x/sB9d4iUPptPOHJQTUyuKQFGk8rD2CKTUelMc7M//gA/qgnH1H0xpzA0dWVUoCRano7xouNFYE3rsTbvgzLvzEcS4slBqKVsxPtB7HHTcdFwZoUpbt6654z0cCtd4QNEGCp3nRX08H6Rz8KskuyPr24xP9aef0417R/yLGTjYGLUkNIYyytKa8BU9Kaux6oglYgzyGIxxFasFgyNt5no4/icqYV87HZ4nP1b83WjAhTE0ZJ2eRw2VvJv7PGzVUvcxbZeuG0MrFoarRnByf9cfHyqMkPY0EpFcxK5cT8CApttyLJWSXv3yV7G8Osq3B87exjVdsgRX2iK2xDZaxF2ybvWF7bMgkO2Nf2Xf2I/mcfEt+Jr8Wo0vJxc5D9g+S338Ali+tYQ==</latexit> m m ⇢rm`(m) <latexit sha1_base64="98CcIRFFgD6zRorNq9GoJPRyZI4=">AAACEHicbVA9SwNBEN3z2/gVtRRkUUQtDHeKaCWCjWUEE4VcCHObObO4H8funhJCfoKNnf4NGwtFbC3t/DduEgu/Hgw83pthZl6SCW5dGH4EQ8Mjo2PjE5OFqemZ2bni/ELV6twwrDAttDlPwKLgCiuOO4HnmUGQicCz5PKo559dobFcq1PXzrAu4ULxlDNwXmoU1yWNBaYOjNHXVNItGpuWpjRWkAhoeBeF2JCbjeJqWAr7oH9J9EVWD5fvergvN4rvcVOzXKJyTIC1tSjMXL0DxnEmsFuIc4sZsEu4wJqnCiTaeqf/UJeueaVJU218KUf76veJDkhr2zLxnRJcy/72euJ/Xi136X69w1WWO1RssCjNBXWa9tKhTW6QOdH2BJjh/lbKWmCAOZ9hwYcQ/X75L6lul6Kd0u6JT+OADDBBlsgK2SAR2SOH5JiUSYUwckMeyBN5Dm6Dx+AleB20DgVfM4vkB4K3T4sKn2A=</latexit> GD: E q[`(✓)] ⇡ `(m) <latexit sha1_base64="qsg07BnB/paLQCxpgnk/lIhGH4M=">AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4=</latexit> “Global” to “local” (the delta method) := m <latexit sha1_base64="a96HRJceYu7AvbLB2G0npJD7GWA=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqswooghqwY3LCvYBnaFkMpk2NMkMSUYopb/hxoUibv0K/8Cdf2M67UJbDwQO55zLvTlhypk2rvvtFJaWV1bXiuuljc2t7Z3y7l5TJ5kitEESnqh2iDXlTNKGYYbTdqooFiGnrXBwO/Fbj1RplsgHM0xpIHBPspgRbKzk+9xGI4wur5Dolitu1c2BFok3I5Wbz9Mc9W75y48SkgkqDeFY647npiYYYWUY4XRc8jNNU0wGuEc7lkosqA5G+c1jdGSVCMWJsk8alKu/J0ZYaD0UoU0KbPp63puI/3mdzMQXwYjJNDNUkumiOOPIJGhSAIqYosTwoSWYKGZvRaSPFSbG1lSyJXjzX14kzZOqd1o9u3crtWuYoggHcAjH4ME51OAO6tAAAik8wQu8Opnz7Lw579NowZnN7MMfOB8/uDmTBg==</latexit> Expectation parameters Natural parameters Gaussian distribution µ := E q[✓] = m <latexit sha1_base64="cNeSOwgBk16lIirR3C0fcoEhwJY=">AAACCHicbVA9SwNBEN3zM8avqKWFq0GwCneKKEIkEATLCCYRckfY22zMkt27c3dOCUcKCxut/RU2ForY+hPs/DfuJSk0+mDg8d4MM/P8SHANtv1lTUxOTc/MZuay8wuLS8u5ldWaDmNFWZWGIlQXPtFM8IBVgYNgF5FiRPqC1f1uOfXr10xpHgbn0IuYJ8llwNucEjBSM7fhyhgfFbErCXR8PznpN68aLnQYEA8XsWzm8nbBHgD/Jc6I5EubN+Kh/HhbaeY+3VZIY8kCoIJo3XDsCLyEKOBUsH7WjTWLCO2SS9YwNCCSaS8ZPNLH20Zp4XaoTAWAB+rPiYRIrXvSN53pvXrcS8X/vEYM7UMv4UEUAwvocFE7FhhCnKaCW1wxCqJnCKGKm1sx7RBFKJjssiYEZ/zlv6S2W3D2CvtnJo1jNEQGraMttIMcdIBK6BRVUBVRdIee0At6te6tZ+vNeh+2TlijmTX0C9bHN129m+g=</latexit> Entropy q(✓) := N(m, 1) <latexit sha1_base64="VLIJksnLhlFGb0I6mjjY2mTg0ss=">AAACCHicbVC7SgNBFJ2NrxhfUQsLCweDsAEJu4pEBCFgYyURzAOyS5idTMyQ2Yczd4WwpLSx8UNsLBSxzSfY+SPWTh6FJh64cDjnXu69x4sEV2BZX0Zqbn5hcSm9nFlZXVvfyG5uVVUYS8oqNBShrHtEMcEDVgEOgtUjyYjvCVbzuhdDv3bPpOJhcAO9iLk+uQ14m1MCWmpm9+5MBzoMSB6fnWPHJ9ChRCRXfdM/xHYeN7M5q2CNgGeJPSG50k48eHLM73Iz++m0Qhr7LAAqiFIN24rATYgETgXrZ5xYsYjQLrllDU0D4jPlJqNH+vhAKy3cDqWuAPBI/T2REF+pnu/pzuGlatobiv95jRjap27CgygGFtDxonYsMIR4mApucckoiJ4mhEqub8W0QyShoLPL6BDs6ZdnSfWoYB8XTq51GkU0Rhrton1kIhsVUQldojKqIIoe0DN6RW/Go/FivBsf49aUMZnZRn9gDH4AN0KbMg==</latexit> H(q) := log(2⇡)/2 <latexit sha1_base64="TeFJTJQysL/svzgL44d23d+iA7k=">AAACB3icbVC7SgNBFJ2NrxhfqxYWggwGYdPE3YhEBCFgkzKCeUB2CbOT2WTI7MOZWSEs6Wys/A8bC0Vs9RPs/BFrZ5MUmnjgwuGce7n3HjdiVEjT/NIyC4tLyyvZ1dza+sbmlr690xBhzDGp45CFvOUiQRgNSF1SyUgr4gT5LiNNd3CZ+s1bwgUNg2s5jIjjo15APYqRVFJHP7B9JPsYsaQ6Mm4K8PwC2izsGSU7ooXjUkfPm0VzDDhPrCnJV/bijwfb+K519E+7G+LYJ4HEDAnRtsxIOgnikmJGRjk7FiRCeIB6pK1ogHwinGT8xwgeKaULvZCrCiQcq78nEuQLMfRd1ZleLWa9VPzPa8fSO3MSGkSxJAGeLPJiBmUI01Bgl3KCJRsqgjCn6laI+4gjLFV0ORWCNfvyPGmUitZJ8fRKpVEGE2TBPjgEBrBAGVRAFdRAHWBwBx7BM3jR7rUn7VV7m7RmtOnMLvgD7f0H5ASbDA==</latexit> ✓ ✓ ⇢r✓`(✓) <latexit sha1_base64="a6+5Yxu90ENkXxqRBNrJ/+cYKDo=">AAACI3icbVBNS1tBFJ2X1hqjrbEuhTI0CHHR8J4iLS6K4MalBaNCXgj3Te5Lhsybeczc1xJCfoiuuvGvuHGhiBsX/S9OPgpt9MDA4ZxzuXNPkivpKAyfgtKbt0vvlssrldW19x/Wqxsfz5wprMCmMMrYiwQcKqmxSZIUXuQWIUsUnieDo4l//hOtk0af0jDHdgY9LVMpgLzUqR7E1EcCHitMCaw1v/hc+cJj2zc81pAo6PyNoVL1Gd/pVGthI5yCvyTRnNQOP11OcHXSqT7EXSOKDDUJBc61ojCn9ggsSaFwXIkLhzmIAfSw5amGDF17NL1xzLe90uWpsf5p4lP134kRZM4Ns8QnM6C+W/Qm4mteq6D0W3skdV4QajFblBaKk+GTwnhXWhSkhp6AsNL/lYs+WBDka634EqLFk1+Ss91GtNfY/+Hb+M5mKLMt9pnVWcS+skN2zE5Ykwn2m92wO3YfXAe3wUPwOIuWgvnMJvsPwZ9nwsKoEg==</latexit> BLR: See Section 1.3.1 in Khan and Rue, 2021
  15. Newton’s Method from BLR 19 Derived by choosing a multivariate

    Gaussian Expectation parameters Natural parameters Gaussian distribution q(✓) := N(✓|m, S 1) <latexit sha1_base64="DKCbWsaIbOyC4lwnalpsd19I1Zg=">AAACFHicbVDLSgNBEJyNrxhfUY+CDAYhQQ27iiiCEvDiSSKaByQxzE4myZDZhzO9QljzEV4E/REvHhTx6sGbf+NskoMmFjQUVd10d9m+4ApM89uITUxOTc/EZxNz8wuLS8nllaLyAklZgXrCk2WbKCa4ywrAQbCyLxlxbMFKduc08ku3TCruuVfQ9VnNIS2XNzkloKV6cusmXYU2A5LBR8e46hBoUyLC895QvsPONr68DnesXqaeTJlZsw88TqwhSeXWHyM85evJr2rDo4HDXKCCKFWxTB9qIZHAqWC9RDVQzCe0Q1qsoqlLHKZqYf+pHt7USgM3PanLBdxXf0+ExFGq69i6M7pajXqR+J9XCaB5WAu56wfAXDpY1AwEBg9HCeEGl4yC6GpCqOT6VkzbRBIKOseEDsEafXmcFHez1l52/0KncYIGiKM1tIHSyEIHKIfOUB4VEEX36Bm9ojfjwXgx3o2PQWvMGM6soj8wPn8ASxmg0w==</latexit> := {Sm, S/2} <latexit sha1_base64="fHjnr3/3r7/D+xibQRmAB8DwpFU=">AAACAXicbVDLSgMxFM3UV62vURcKboJF6ELrTEUqglBw47JS+4DOUDKZtA3NZIYkI5ShbvwAf8KNC0XcuvMT3Pkjrk0fC209EDiccy4393gRo1JZ1peRmptfWFxKL2dWVtfWN8zNrZoMY4FJFYcsFA0PScIoJ1VFFSONSBAUeIzUvd7l0K/fEiFpyG9UPyJugDqctilGSkstc9dhOuwjeH4BnaQSHMKjynEBOoOWmbXy1ghwltgTki3txB8PTu673DI/HT/EcUC4wgxJ2bStSLkJEopiRgYZJ5YkQriHOqSpKUcBkW4yumAAD7Tiw3Yo9OMKjtTfEwkKpOwHnk4GSHXltDcU//OasWqfuQnlUawIx+NF7ZhBFcJhHdCngmDF+pogLKj+K8RdJBBWurSMLsGePnmW1Ap5+yR/eq3bKIIx0mAP7IMcsEERlMAVKIMqwOAOPIJn8GLcG0/Gq/E2jqaMycw2+APj/QcLsphb</latexit> µ := {E q(✓), E q(✓✓>)} <latexit sha1_base64="+od0oFA4OIy6U3mBfmU4FcpPHws=">AAACKXicbVDLSgMxFM3UV62vqjvdBIvQgpQZRSqCUBDBZQX7gE4tmTRtQzMPkztCGbr2T9z4BX6BLtxYUNStP2L6WGjbAwmHc+7l3nucQHAFpvllxObmFxaX4suJldW19Y3k5lZJ+aGkrEh94cuKQxQT3GNF4CBYJZCMuI5gZadzPvDLd0wq7nvX0A1YzSUtjzc5JaClejJvuyE+PcN2ZLsE2o4TXfTqt2kb2gxI5gDPUEf/jQ1+kLF79WTKzJpD4GlijUkqv/OSve8/PxXqyb7d8GnoMg+oIEpVLTOAWkQkcCpYL2GHigWEdkiLVTX1iMtULRpe2sP7Wmngpi/18wAP1b8dEXGV6rqOrhwsria9gTjLq4bQPKlF3AtCYB4dDWqGAoOPB7HhBpeMguhqQqjkeldM20QSCjrchA7Bmjx5mpQOs9ZR9vhKp5FDI8TRLtpDaWShHMqjS1RARUTRA3pF7+jDeDTejE/je1QaM8Y92+gfjJ9fEDKq8A==</latexit> ⇢rµ (E q[`(✓)] H(q)) <latexit sha1_base64="ZgoUih72jNp2X1gy1YFrJC9GVEM=">AAACYHicbVFNa9RAGJ6kfqxra7d608toEXYPLklF2pMUitBjBbct7ITwZvJmd+jkozNvXJaQg2fP/jFvHrz4S5xkK2jrCwMPz/N+PpNUWlkKgh+ev3Xv/oOHg0fDx9s7T3ZHe0/PbVkbiTNZ6tJcJmBRqwJnpEjjZWUQ8kTjRXJ10ukXn9FYVRafaF1hlMOiUJmSQI6KRyuhXXIKXGjMCIwpV/wP9YYLsyy5KCDREDeN6Mc1BtOWi7xu275m7DDQMkmaD218zecCtR4LWiLBJOp6dKoE3Zy24+uJa6kWS5rEo/1gGvTB74LwBuwfv1zpryffvpzFo+8iLWWdY0FSg7XzMKgoasCQkhrboagtViCvYIFzBwvI0UZNv3HLXzsm5Vlp3CuI9+zfFQ3k1q7zxGV229rbWkf+T5vXlB1FjSqqmrCQm0FZrTmVvHObp8qgJL12AKRRblcul2BAkvuToTMhvH3yXXB+MA3fTt99dG68Z5sYsBfsFRuzkB2yY3bKztiMSfbT2/K2vR3vlz/wd/29Tarv3dQ8Y/+E//w3iHi5FA==</latexit> (1 ⇢) ⇢rµ E q[`(✓)] <latexit sha1_base64="BkNToCKatl5ZdzOolUcQDn7KoPQ=">AAACSnicbVBNaxRBEO1Zo8b1a9VjLh2DsDm4zCgSTxIIgscI2SSwPQw1PTXZJj3dY3dNwjLMwbNnf5MXT978EV5yUMRLencT0MQHDa/eq6KqX15r5SmOv0e9Gys3b91evdO/e+/+g4eDR4/3vW2cxLG02rrDHDxqZXBMijQe1g6hyjUe5Mc7c//gBJ1X1uzRrMa0giOjSiWBgpQNQOjQXAAXGksC5+wpHybPuXBTu8kvzWXNhYFcQ9a2YrG4dVh0XFRN14kKaJrn7dsu+8AnArUeCpoiwWaaDTbiUbwAv06SC7KxvX6qP+18/ribDb6JwsqmQkNSg/eTJK4pbcGRkhq7vmg81iCP4QgngRqo0Kft4qKOPwtKwUvrwjPEF+rfEy1U3s+qPHTOT/ZXvbn4P2/SUPk6bZWpG0Ijl4vKRnOyfJ4rL5RDSXoWCEinwq1cTsGBpJB+P4SQXP3ydbL/YpS8HL16H9J4w5ZYZWvsKRuyhG2xbfaO7bIxk+wL+8F+sl/R1+gs+h39Wbb2oouZJ+wf9FbOASYWtg8=</latexit> rµH(q) = <latexit sha1_base64="zzQxPR53TnAKbwz7GwCTcLjExB4=">AAACDnicbVA9TwJBEN3DL8Qv1NJmlZBgIbnTGG00JDSUmMhHwhEytyywYW/v3N3TkAuFNY2NP8TGQmNsre38Ny5goeBLNnl5b2Z25nkhZ0rb9peVWFhcWl5JrqbW1jc2t9LbO1UVRJLQCgl4IOseKMqZoBXNNKf1UFLwPU5rXr849mu3VCoWiGs9CGnTh65gHUZAG6mVzh5hV4DHoeX6EXZ90D0CPC4NczeH+AK73IxqQyudsfP2BHieOD8kU9i/46Pi4325lf502wGJfCo04aBUw7FD3YxBakY4HabcSNEQSB+6tGGoAJ+qZjw5Z4izRmnjTiDNExpP1N8dMfhKDXzPVI73VbPeWPzPa0S6c96MmQgjTQWZftSJONYBHmeD20xSovnAECCSmV0x6YEEok2CKROCM3vyPKke552T/OmVSeMSTZFEe+gA5ZCDzlABlVAZVRBBI/SEXtCr9WA9W2/W+7Q0Yf307KI/sD6+AXn3nho=</latexit> ✓ ✓ H 1 ✓ [r✓`(✓)] <latexit sha1_base64="cC3LQqRSdlYDPtPx6+bJYnkjhX8=">AAACO3icbZDNSxxBEMV71ETdfK16zKVRAnpwmTGEeBJBCB5NcFXYmQw1vTW7jT09Q3eNsgz7f3nx7i3izUsuORjEq3d7d1YwmgcNP15VUV0vKZS05PvX3tT0zKvXs3PzjTdv373/0FxYPLB5aQS2Ra5yc5SARSU1tkmSwqPCIGSJwsPkeGdUPzxBY2Wu92lQYJRBT8tUCiBnxc0fIfWRgIcKUwJj8lM+cdb5blzjz2o9GNYdHR5qSBTEj2Oo1GrNazw0stenKG6u+C1/LP4SggmsbC+n335drF3uxc2rsJuLMkNNQoG1ncAvKKrAkBQKh42wtFiAOIYedhxqyNBG1fj2If/knC5Pc+OeJj52n05UkFk7yBLXmQH17fPayPxfrVNSuhlVUhcloRb1orRUnHI+CpJ3pUFBauAAhJHur1z0wYAgF3fDhRA8P/klHGy0gs+tL99dGlus1hz7yJbZKgvYV7bNdtkeazPBzthvdsP+eufeH+/Wu6tbp7zJzBL7R979Az/dsUw=</latexit> Newton’s method: ⇢ (rµ E q[`(✓)] + ) <latexit sha1_base64="jKVIoiZqP+0Cz4aycu5oQngzcqI=">AAACWnicbVFNi9RAEO1E3Y9ZP8aPm5fWRZhFHBJF9CQLi+BxBWd3YTqESqcyabbTyXZXXIaQg2fP/jEvIvhXBDszq+iuBd083qvqqnqdNVo5iqLvQXjt+o2Nza3t0c7NW7fvjO/eO3J1ayXOZK1re5KBQ60MzkiRxpPGIlSZxuPs9GDQjz+idao2H2jZYFLBwqhCSSBPpeMzoX1yDlxoLAisrc/5b+oZF7as18qECwOZhrTrxKprZzHvuajafriByizr3vbpGZ8L1HoiqESCvYQ//fOcsGpR0l463o2m0Sr4VRBfgN39R+f688GXT4fp+KvIa9lWaEhqcG4eRw0lHVhSUmM/Eq3DBuQpLHDuoYEKXdKthuz5E8/kvKitP4b4iv27ooPKuWWV+cxhCXdZG8j/afOWitdJp0zTEhq5blS0mlPNB595rixK0ksPQFrlZ+WyBAuS/G+MvAnx5ZWvgqPn0/jF9OV778Ybto4t9pA9ZhMWs1dsn71jh2zGJPvGfgYbwWbwIwzD7XBnnRoGFzX32T8RPvgFI1C3eg==</latexit> Sm (1 ⇢)Sm ⇢rE q(✓) E q[`(✓)] <latexit sha1_base64="L0YxREwgd4xVW52Hx4kXKtg8eag=">AAACUnicbVJdSxwxFM1uP7Srbdf62JdQsawPXWYsYp+KUAQfLe2qsBmGO9k7bjCTjMmdynaYP9U/Iogv/hAR+tA2u2vBai8EDufcyz05SVZq5SmKrlrtR4+fPF1YfNZZWn7+4mV35dWBt5WTOJBWW3eUgUetDA5Ikcaj0iEUmcbD7OTTVD/8hs4ra77SpMSkgGOjciWBApV21ZeCC405gXP2jPfid1y4sd3ggZ9DLgxkGtK6FrN1tcNRw7kogMZZVu826WlP0BgJNprmLsuHArX+qyVpdy3qR7PiD0F8C9Z23t58v/xBYj/tXoiRlVWBhqQG74dxVFJSgyMlNTYdUXksQZ7AMQ4DNFCgT+qZx4avB2bEc+vCMcRn7N2JGgrvJ0UWOqeW/X1tSv5PG1aUf0hqZcqK0Mj5orzSnCyf5stHyqEkPQkApFPBK5djcCApvEInhBDfv/JDcLDZj9/3tz6HND6yeS2y1+wN67GYbbMdtsf22YBJds6u2S/2u3XZ+tkOv2Te2m7dzqyyf6q9/Ac9Kbfo</latexit> 1 2 S (1 ⇢) 1 2 S + ⇢rE q(✓✓>) E q[`(✓)] <latexit sha1_base64="RvTTRGcWhoKWYWyVOATRBqr356E=">AAACdnicbVHLbhMxFPUMrxJeKawQErIaVSRCiWbaIlihShUSyyJIWykeRh7nTmPVM57ad0CRNQs+gD3fwmfQFR/BArFhiTPJog+uZOvo3HPl43OzSkmLUfQzCK9dv3Hz1trtzp279+4/6K4/PLC6NgLGQittjjJuQckSxihRwVFlgBeZgsPsZG/RP/wExkpdfsB5BUnBj0uZS8HRU2n365DlhgsXN26roe8pU5AjN0Z/pkPaj4eUmZke0Iui5y1LWckzxVPnWOvDGZg2lLKC4yzL3JsmPe0znAHy5f2Roa4GTXNeQCcMlFrJBkna7UWjqC16FcQr0Nvd+f7l2++zX/tp9webalEXUKJQ3NpJHFWYOG5QCgVNh9UWKi5O+DFMPCx5ATZxrd2GbnpmSnNt/CmRtuz5CccLa+dF5pULy/Zyb0H+rzepMX+VOFlWNUIplg/ltaKo6WIHdCoNCFRzD7gw0nulYsZ9wug31fEhxJe/fBUcbI3i7dGLdz6N12RZa+QJ2SB9EpOXZJe8JftkTAT5EzwONoJe8Dd8Gm6Gz5bSMFjNPCIXKoz+AZn8wxI=</latexit> S (1 ⇢)S ⇢2rE q(✓✓>) E q[`(✓)] <latexit sha1_base64="m7F8o13Jqv7EzTrgw8ytR43m1nI=">AAACXXicbVHBbtQwEHUCLe1SyrZF4sBlRIW0e2CVFKH2RCshJA4cisq2ldYhcryTrlUnTu0JaBXl9/gAbnDhV/Bu9lBanmTr6b03mvE4q7RyFEW/gvDBw7X1RxubvcdbT7af9nd2z52prcSxNNrYy0w41KrEMSnSeFlZFEWm8SK7fr/wL76hdcqUX2heYVKIq1LlSgryUtqnM+AacxLWmu8wiF8DtzMzhDPoGBwAL0WmRdo0fNmusThtAXghaJZlzYc2vRlwmiGJ7v7KyVTDtr0dgAlHrVexYZL296NRtATcJ/GK7J8MPv04fra7fpr2f/KpkXWBJUktnJvEUUVJIywpqbHt8dphJeS1uMKJp6Uo0CXNctwWXnllCrmx/pQES/V2RSMK5+ZF5pOLkd1dbyH+z5vUlB8ljSqrmrCUXaO81kAGFquGqbIoSc89EdIqPyvImbBCkv+Qnl9CfPfJ98n5wSh+M3r72W/jHeuwwV6wl2zAYnbITthHdsrGTLLfAQs2g17wJ1wLt8LtLhoGq5o99g/C538BvoO2CQ==</latexit> 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). See Section 1.3.2 in Khan and Rue, 2021
  16. Newton’s Method from BLR 20 Sm (1 ⇢)Sm ⇢rE q(✓)

    E q[`(✓)] <latexit sha1_base64="L0YxREwgd4xVW52Hx4kXKtg8eag=">AAACUnicbVJdSxwxFM1uP7Srbdf62JdQsawPXWYsYp+KUAQfLe2qsBmGO9k7bjCTjMmdynaYP9U/Iogv/hAR+tA2u2vBai8EDufcyz05SVZq5SmKrlrtR4+fPF1YfNZZWn7+4mV35dWBt5WTOJBWW3eUgUetDA5Ikcaj0iEUmcbD7OTTVD/8hs4ra77SpMSkgGOjciWBApV21ZeCC405gXP2jPfid1y4sd3ggZ9DLgxkGtK6FrN1tcNRw7kogMZZVu826WlP0BgJNprmLsuHArX+qyVpdy3qR7PiD0F8C9Z23t58v/xBYj/tXoiRlVWBhqQG74dxVFJSgyMlNTYdUXksQZ7AMQ4DNFCgT+qZx4avB2bEc+vCMcRn7N2JGgrvJ0UWOqeW/X1tSv5PG1aUf0hqZcqK0Mj5orzSnCyf5stHyqEkPQkApFPBK5djcCApvEInhBDfv/JDcLDZj9/3tz6HND6yeS2y1+wN67GYbbMdtsf22YBJds6u2S/2u3XZ+tkOv2Te2m7dzqyyf6q9/Ac9Kbfo</latexit> S (1 ⇢)S ⇢2rE q(✓✓>) E q[`(✓)] <latexit sha1_base64="m7F8o13Jqv7EzTrgw8ytR43m1nI=">AAACXXicbVHBbtQwEHUCLe1SyrZF4sBlRIW0e2CVFKH2RCshJA4cisq2ldYhcryTrlUnTu0JaBXl9/gAbnDhV/Bu9lBanmTr6b03mvE4q7RyFEW/gvDBw7X1RxubvcdbT7af9nd2z52prcSxNNrYy0w41KrEMSnSeFlZFEWm8SK7fr/wL76hdcqUX2heYVKIq1LlSgryUtqnM+AacxLWmu8wiF8DtzMzhDPoGBwAL0WmRdo0fNmusThtAXghaJZlzYc2vRlwmiGJ7v7KyVTDtr0dgAlHrVexYZL296NRtATcJ/GK7J8MPv04fra7fpr2f/KpkXWBJUktnJvEUUVJIywpqbHt8dphJeS1uMKJp6Uo0CXNctwWXnllCrmx/pQES/V2RSMK5+ZF5pOLkd1dbyH+z5vUlB8ljSqrmrCUXaO81kAGFquGqbIoSc89EdIqPyvImbBCkv+Qnl9CfPfJ98n5wSh+M3r72W/jHeuwwV6wl2zAYnbITthHdsrGTLLfAQs2g17wJ1wLt8LtLhoGq5o99g/C538BvoO2CQ==</latexit> m m ⇢S 1rm`(m) S (1 ⇢)S + ⇢Hm <latexit sha1_base64="v9K1a5jlFvx60cg+XbCII4W4H7U=">AAACW3icbVFNaxsxENVumsbdpK3T0lMvQ0yCQ7HZ7QftKQR6yTElcRKwXKOVZ2MRfSySNsUs+yd7Sg/9K6Xyxoc66QPB4715M9IoL6VwPk1/RfHGk82nW51nyfbO8xcvu7uvLpypLMcRN9LYq5w5lELjyAsv8aq0yFQu8TK/+br0L2/ROmH0uV+UOFHsWotCcOaDNO1aBQdUYuGZteYHKBgAtXMDNW171xZnDZxB870eZA1QzXLJpgooStlXh5QmZ2v5fjZYxg9D5N16j7brSYg2024vHaYt4DHJVqRHVjiddn/SmeGVQu25ZM6Ns7T0k5pZL7jEJqGVw5LxG3aN40A1U+gmdTu7gf2gzKAwNhztoVX/TdRMObdQeahUzM/dQ28p/s8bV774MqmFLiuPmt8PKioJ3sBy0TATFrmXi0AYtyLcFficWcZ9+I4kLCF7+OTH5OL9MPsw/PTtY+/4aLWODnlL9kifZOQzOSYn5JSMCCd35E+0FXWi3/FGnMQ796VxtMq8JmuI3/wFGkeyFA==</latexit> rE q(✓) E q[`(✓)] = E q[r✓`(✓)] 2E q[H✓]m rE q(✓✓>) E q[`(✓)] = E q[H✓] <latexit sha1_base64="1tKNoi6a6uPuQLwufe4gfqBe06A=">AAADHniclVJNaxRBEO2Z+BHXr40evRQuSjy4zMSEeDEERMgxgpsEtsehp7c226SnZ+yuEZZhfkku+StePCgieNJ/48zuBHY3OWhBN4/36nW9bjrJtXIUBH88f+3GzVu31+907t67/+Bhd+PRkcsKK3EgM53Zk0Q41MrggBRpPMktijTReJycvW30489oncrMB5rmGKXi1KixkoJqKt7wtrkRiRZxWfLZaaXFUQXAU0GTJCnfVfGnTU4TJPGiqhZZGHLU+lKL4PkbWJaXDmynzLth0VlBBC9ha8UMS+6DS2PTnALnnX9OPd8/csry/73A9RmaCHG3F/SDWcFVELagx9o6jLu/+CiTRYqGpBbODcMgp6gUlpTUWHV44TAX8kyc4rCGRqToonI2vYJnNTOCcWbrZQhm7KKjFKlz0zSpO5v8blVryOu0YUHj11GpTF4QGjkfNC40UAbNX4GRsihJT2sgpFV1VpATYYWk+kd16kcIV698FRxt9cNX/Z332739vfY51tkT9pRtspDtsn12wA7ZgEnv3PviffO++xf+V/+H/3Pe6nut5zFbKv/3X0+sAFw=</latexit> ✓ ✓ H 1 ✓ [r✓`(✓)] <latexit sha1_base64="cC3LQqRSdlYDPtPx6+bJYnkjhX8=">AAACO3icbZDNSxxBEMV71ETdfK16zKVRAnpwmTGEeBJBCB5NcFXYmQw1vTW7jT09Q3eNsgz7f3nx7i3izUsuORjEq3d7d1YwmgcNP15VUV0vKZS05PvX3tT0zKvXs3PzjTdv373/0FxYPLB5aQS2Ra5yc5SARSU1tkmSwqPCIGSJwsPkeGdUPzxBY2Wu92lQYJRBT8tUCiBnxc0fIfWRgIcKUwJj8lM+cdb5blzjz2o9GNYdHR5qSBTEj2Oo1GrNazw0stenKG6u+C1/LP4SggmsbC+n335drF3uxc2rsJuLMkNNQoG1ncAvKKrAkBQKh42wtFiAOIYedhxqyNBG1fj2If/knC5Pc+OeJj52n05UkFk7yBLXmQH17fPayPxfrVNSuhlVUhcloRb1orRUnHI+CpJ3pUFBauAAhJHur1z0wYAgF3fDhRA8P/klHGy0gs+tL99dGlus1hz7yJbZKgvYV7bNdtkeazPBzthvdsP+eufeH+/Wu6tbp7zJzBL7R979Az/dsUw=</latexit> Newton’s method: Express in terms of gradient and Hessian of loss: E q[`(✓)] ⇡ `(m) <latexit sha1_base64="qsg07BnB/paLQCxpgnk/lIhGH4M=">AAACE3icbVC7SgNBFJ31bXxFLW1GRVCLsKuIViIEwTKCeUB2CbOTm2TI7MOZu2pYUljb2ORXbCwUsbWx82+cPAqNHhg4nHMvd87xYyk02vaXNTE5NT0zOzefWVhcWl7Jrq6VdJQoDkUeyUhVfKZBihCKKFBCJVbAAl9C2W/n+375BpQWUXiFnRi8gDVD0RCcoZFq2X03YNjy/fS8W7uuuiDlrostQLbnUZfFsYru6EAN9mrZbTtnD0D/EmdEts82b+VDvndfqGU/3XrEkwBC5JJpXXXsGL2UKRRcQjfjJhpixtusCVVDQxaA9tJBpi7dMUqdNiJlXoh0oP7cSFmgdSfwzWQ/gR73+uJ/XjXBxomXijBOEEI+PNRIJMWI9guidaGAo+wYwrgS5q+Ut5hiHE2NGVOCMx75Lykd5JzD3NGlaeOUDDFHNsgW2SUOOSZn5IIUSJFw8kieyAt5tXrWs/VmvQ9HJ6zRzjr5BevjG5S3oO4=</latexit> Delta Method Set =1 to get m m H 1 m [rm`(m)] <latexit sha1_base64="30gUsIDe7mBYYbgyknga5qh7a6I=">AAACFnicbVDLSgNBEJz1bXxFPXoZDIoeEnZ9oCcRvHhUMCpk16V30quDM7PLzKwSlnyFF3/FiwdFvIo3/8ZJzMFXQUNR1U13V5ILbqzvf3hDwyOjY+MTk5Wp6ZnZuer8wqnJCs2wyTKR6fMEDAqusGm5FXieawSZCDxLrg96/tkNasMzdWI7OUYSLhVPOQPrpLhal3Q1FJha0Dq7pZLW6WEsL8p60KWtUEEiIJY0RCHW5HoUV2t+w++D/iXBgNTIAEdx9T1sZ6yQqCwTYEwr8HMblaAtZwK7lbAwmAO7hktsOapAoonK/ltduuKUNk0z7UpZ2le/T5QgjenIxHVKsFfmt9cT//NahU13o5KrvLCo2NeitBDUZrSXEW1zjcyKjiPANHe3UnYFGph1SVZcCMHvl/+S041GsNnYPt6q7e8N4pggS2SZrJGA7JB9ckiOSJMwckceyBN59u69R+/Fe/1qHfIGM4vkB7y3T0BGnYw=</latexit> ⇢ <latexit sha1_base64="LY4aJqz59GMbQ6sd5USq7FFmmvw=">AAAB63icbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxgjGBZAmzk9nskHksM7NCCPkFLx4U8eoPefNvnE32oIkFDUVVN91dUcqZsb7/7ZVWVtfWN8qbla3tnd296v7Bo1GZJrRFFFe6E2FDOZO0ZZnltJNqikXEaTsa3eZ++4lqw5R8sOOUhgIPJYsZwTaXejpR/WrNr/szoGUSFKQGBZr96ldvoEgmqLSEY2O6gZ/acIK1ZYTTaaWXGZpiMsJD2nVUYkFNOJndOkUnThmgWGlX0qKZ+ntigoUxYxG5ToFtYha9XPzP62Y2vg4nTKaZpZLMF8UZR1ah/HE0YJoSy8eOYKKZuxWRBGtMrIun4kIIFl9eJo9n9eC8fnl/UWvcFHGU4QiO4RQCuIIG3EETWkAggWd4hTdPeC/eu/cxby15xcwh/IH3+QMgvI5K</latexit> 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). See Section 1.3.2 in Khan and Rue, 2021
  17. RMSprop/Adam from BLR 21 1. Khan, et al. "Fast and

    scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). RMSprop BLR for Gaussian approx To get RMSprop, make the following choices • Restrict covariance to be diagonal • Replace Hessian by square of gradients • Add square root for scaling vector S (1 ⇢)S + ⇢(H✓) m m ↵S 1r✓`(✓) <latexit sha1_base64="fTMsyehhC7I4gYoyh5dms3QHSNE=">AAACenicbZFNaxsxEIa126/U/XLbYykMMU1tgs1um5AeA73kmOI6CViumZVnsyLSapG0DWbZH5G/llt/SS89dG3voU46IHh4Z17NSJMUSjofRb+C8MHDR4+f7DztPHv+4uWr7us3Z86UVtBEGGXsRYKOlMxp4qVXdFFYQp0oOk+uvq7y5z/JOmny735Z0EzjZS5TKdA30rx7M4Y9rij1aK25hn485DYzAxjDPqwI+hVfd6ksLWo4mXOfkcd6wHlHb1k1DIGjKjKELcu4/lEN43pb5DkmCuv2NuCkVH/Dg3m3F42idcB9iFvosTZO591bvjCi1JR7odC5aRwVflah9VIoqju8dFSguMJLmjaYoyY3q9bD1PChURaQGtuc3MNa/ddRoXZuqZOmUqPP3N3cSvxfblr69MusknlResrFplFaKvAGVnuAhbQkvFo2gMLKZlYQGVoUvtlWp/mE+O6T78PZp1H8eXT47aB3fNR+xw57x3ZZn8XsiB2zE3bKJkyw38H7YC/4GPwJd8NBuL8pDYPW85ZtRXjwFxMjvuc=</latexit> s (1 ⇢)s + ⇢[ ˆ r`(✓)]2 ✓ ✓ ↵( p s + ) 1 ˆ r`(✓) <latexit sha1_base64="RyiZAK7rqVMEqEFbAuWUs+AixI0=">AAACfnicdVFNb9QwEHVSPsrytYUjF8OKaku1S9JStcdKvfRYJLattE5XE++kserEwZ6AVtH+DP4YN34LF5zdHGgLI1l682besz2TVlo5iqJfQbjx4OGjx5tPek+fPX/xsr/16tyZ2kqcSKONvUzBoVYlTkiRxsvKIhSpxov05qStX3xD65Qpv9CiwqSA61JlSgJ5atb/4fi20JgRWGu+82E8EjY3O9zxXd4iPhU5UCNKSDUsuUCth4JyJNhJrvaE6K2TWyYdNeICdJUDHwr31VLjlq3nHLXXXjWj2Lv9z3rWH0TjaBX8Pog7MGBdnM36P8XcyLrAkqQG56ZxVFHSgCUlNS57onZYgbyBa5x6WEKBLmlW41vy956Z88xYf0riK/ZvRQOFc4si9Z0FUO7u1lryX7VpTdlR0qiyqglLub4oqzUnw9td8LmyKEkvPABplX8rlzlYkOQ31vNDiO9++T443xvH++ODz58Gx4fdODbZG/aODVnMDtkxO2VnbMIk+x28DT4EuyELt8NR+HHdGgad5jW7FeHRH9LYvm0=</latexit> For Adam, use a Heavy-ball term with KL divergence as momentum (Appendix E in [1]) See Section 4.2 in Khan and Rue, 2021
  18. BLR for large deep networks 22 RMSprop/Adam BLR variant Improved

    Variational Online Newton (IVON) <latexit sha1_base64="6cqgdMpwEvsCUhlzFD20sqy/oaU=">AAACt3icbVFNj9MwEHXC11K+Chy5jKhAqdCWZFW+bitx4bhIdHdFXaqJ49TWOnHWdkBVlL/IgRv/BidtJbq7I1l6eu/N83icVkpaF8d/g/DW7Tt37x3cHzx4+Ojxk+HTZ6dW14bxGdNKm/MULVey5DMnneLnleFYpIqfpRefO/3sJzdW6vKbW1d8UeCqlLlk6Dy1HP6mAl2zauE1VTx3aIz+BT1HS0wVtkC5UhF1gjscUzroNXGDf9X+OAJvEHtSlBxSI/QYBLyBDsEuoMvqU/ejNtQhUFSVQIBol971Z1x5sYDxW4iovTR90I7vplsOR/Ek7guug2QLRmRbJ8vhH5ppVhe8dEyhtfMkrtyiQeMkU7wd0NryCtkFrvjcwxILbhdNv/cWXnkmg1wbf0oHPft/R4OFtesi9c4CnbBXtY68SZvXLv+4aGRZ1Y6XbHNRXitwGrpPhEwazpxae4DMSD8rMIEGmfNf3S0hufrk6+D0aJK8n7z7Oh0dT7frOCAvyEsSkYR8IMfkCzkhM8KCafA9YEEWfgqXYR6KjTUMtj3PyV6Fl/8ARwHSgg==</latexit> ˆ g ˆ r`(✓) ˆ h ˆ g2 h (1 ⇢)h + ⇢ˆ h ✓ ✓ ↵(ˆ g + m)/( p h + ) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020). 4. Shen et al. “Variational Learning is Effective for Large Deep Networks.” Under review (2024) Only tune initial value of h (a scalar) <latexit sha1_base64="lMR1EEq84mMqe3ZqmU42lhTl4A0=">AAADlnicbVLbbtNAEN3YXEq4NKUvSLyMiKhspbk4KrcXVFEheKqCRNpK2SRabzZZq7u2tV5TIst/xNfwxt+wdl2C065l6ficmTkz4/VjESR6MPjTsOx79x883HnUfPzk6bPd1t7zsyRKFWVjGolIXfgkYSII2VgHWrCLWDEifcHO/cuTQj//wVQSROF3vY7ZVJJVGCwDSrSh5nuNX5gTna1yOMCCLTVRKrqCksMh8QXJATMhHKw508SFDJeemWILo2j2UyuZwRVnikFBFFGAk0AClkRzSkR2mjvy0FArSWZDF3KMm2V9fofnKsd0EemaS+XdlW7/popxwk1eS3e8LlY8coFDBwoElQk+LJ56450iYDZ0eLcKcmfDvmO+O3jBhJnTLbqUNQMJXcBExJwAODfdFl5lhpHdPjg1F76RXVOuvrp/kxxs7XTj6PWdU4dvarj5vNUe9AblgdvAq0AbVWc0b/3Gi4imkoWaCpIkE28Q62lGlA6oYHkTpwmLCb0kKzYxMCSSJdOsbCeH14ZZwDJS5g01lOz/GRmRSbKWvoks/nWyrRXkXdok1cv30ywI41SzkF4bLVMBOoLijsIiUIxqsTaAUBWYXoFyogjV5iY3zRK87ZFvg7Nhz3vbe/PtqH18VK1jB71Er5CDPPQOHaOvaITGiFr71gfrk3Viv7A/2p/tL9ehVqPK2Ue1Y4/+AgL8INo=</latexit> ˆ g ˆ r`(✓) where ✓ ⇠ N(m, 2) ˆ h ˆ g · (✓ m)/ 2 h (1 ⇢)h + ⇢ˆ h +⇢2(h ˆ h)2/(2(h + )) m m ↵(ˆ g + m)/(h + ) 2 1/(N(h + ))
  19. IVON [3] got 1st prize in NeurIPS 2021 Approximate Inference

    Challenge 23 Watch Thomas Moellenhoff’s talk at https://www.youtube.com/watch?v=LQInlN5EU7E. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Lin et al. “Handling the positive-definite constraints in the BLR.” ICML (2020).
  20. GPT-2 with Bayes 24 Better performance and uncertainty at the

    same cost Trained on OpenWebText data (49.2B tokens). On 773M, we get a gain of 0.5 in perplexity. On 355M, we get a gain of 0.4 in perplexity. 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) BLR (IVON)[3]
  21. GPT-2 with Bayes 25 Posterior averaging improve the result. Can

    also train on low-precision (a stable optimizer) 1. Khan, et al. "Fast and scalable Bayesian deep learning by weight-perturbation in Adam." ICML (2018). 2. Osawa et al. “Practical Deep Learning with Bayesian Principles.” NeurIPS (2019). 3. Shen et al. “Variational Learning is effective for large neural networks.” (Under review) Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) an xity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right). Variational Learning is Effective for Large Deep Networks (a) (b) e 12: IVON results trained on multi-GPU setups with different random seed on each machine (left) and val exity for GPT-2 trained with IVON evaluated at mean and posterior predictive on OpenWebText (right).
  22. ImageNet on ResNet-50 (25.6M) 26 2% better accuracy over AdamW

    and 1% over SGD. Better calibration (ECE of 0.022 vs 0.066) ional Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet onal Learning is Effective for Large Deep Networks (b) ResNet-50 on ImageNet (c) Calibration on ImageNet
  23. ImageNet on ResNet-50 (25.6M) 27 No severe overfitting like AdamW

    while improving accuracy over SGD consistently & better uncertainty Variational Learning is Effective for Large Deep Networks Dataset & Model Epochs Method Top-1 Acc. " Top-5 Acc. " NLL # ECE # Brier # AdamW 74.56±0.24 92.05±0.17 1.018±0.012 0.043±0.001 0.352±0.003 SGD 76.18±0.09 92.94±0.05 0.928±0.003 0.019±0.001 0.330±0.001 IVON@mean 76.14±0.11 92.83±0.04 0.934±0.002 0.025±0.001 0.330±0.001 100 IVON 76.24±0.09 92.90±0.04 0.925±0.002 0.015±0.001 0.330±0.001 AdamW 75.16±0.14 92.37±0.03 1.018±0.003 0.066±0.002 0.349±0.002 SGD 76.63±0.45 93.21±0.25 0.917±0.026 0.038±0.009 0.326±0.006 IVON@mean 77.30±0.08 93.58±0.05 0.884±0.002 0.035±0.002 0.316±0.001 ImageNet-1k ResNet-50 (25.6M params) 200 IVON 77.46±0.07 93.68±0.04 0.869±0.002 0.022±0.002 0.315±0.001 AdamW 47.33±0.90 71.54±0.95 6.823±0.235 0.421±0.008 0.913±0.018 SGD 61.39±0.18 82.30±0.22 1.811±0.010 0.138±0.002 0.536±0.002 IVON@mean 62.41±0.15 83.77±0.18 1.776±0.018 0.150±0.005 0.532±0.002 TinyImageNet ResNet-18 (11M params, wide) 200 IVON 62.68±0.16 84.12±0.24 1.528±0.010 0.019±0.004 0.491±0.001 AdamW 50.65±0.0⇤ 74.94±0.0⇤ 4.487±0.0⇤ 0.357±0.0⇤ 0.812±0.0⇤ AdaHessian 55.03±0.53 78.49±0.34 2.971±0.064 0.272±0.005 0.690±0.008 SGD 59.39±0.50 81.34±0.30 2.040±0.040 0.176±0.006 0.577±0.007 IVON @mean 60.85±0.39 83.89±0.14 1.584±0.009 0.053±0.002 0.514±0.003 TinyImageNet PreResNet-110 (4M params, deep) 200 IVON 61.25±0.48 84.13±0.17 1.550±0.009 0.049±0.002 0.511±0.003 AdamW 64.12±0.43 86.85±0.51 3.357±0.071 0.278±0.005 0.615±0.008 SGD 74.46±0.17 92.66±0.06 1.083±0.007 0.113±0.001 0.376±0.001 IVON@mean 74.51±0.24 92.74±0.19 1.284±0.013 0.152±0.003 0.399±0.002 CIFAR-100 ResNet-18 (11M params, wide) 200 IVON 75.14±0.34 93.30±0.19 0.912±0.009 0.021±0.003 0.344±0.003 +2% +15% +10% +11% +1% +1% +2% +.7%
  24. 28 Sensitivity to data is easy to compute “during” training.

    MNIST on MLP. Also work at large scale (ImageNet ) 1. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS, 2023
  25. Sensitivity to Training Data 29 Training on full dataset Retraining

    without the i’th example Truth Estimated Start Iterations Current Past information with most influence on the present Estimating it without retraining: Using the BLR, we can recover all sorts of influence criteria used in literature.
  26. Memory Perturbation 30 1. Cook. Detection of Influential Observations in

    Linear Regression. Technometrics. ASA 1977 2. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS, 2023 How sensitive is a model to its training data? Deviation ( ) = predictionError *predictionVariance Δ Old model New model New data
  27. Memory Maps using the BLR 31 ast (right) memorable Regular

    examples Prediction Variance Prediction Error Understand generic ML models and algorithms. Unpredictable Uncertain 1. Tailor, Chang, Swaroop, Nalisnick, Solin, Khan, Memory maps to understand models (under review)
  28. Predict Generalization during Training 33 Iterations Training on full dataset

    Current (a) Effect of removing an example. CIFAR10 on ResNet-20 using IVON. SGD or Adam do not work as well. Generalization on test data (NLL) Leave-One-Out Estimates on training data and during training
  29. Answering “What-If” Questions 34 Test Performance (NLL) by brute-force retraining

    Estimates on training data (no retraining) What if we removed a class from MNIST?
  30. Answering “What-If” Questions 35 1. Daheim et al. Model merging

    by uncertainty-based gradient matching, ICLR (2024). ✓1 2 P =1 Ht t 0.2 0.4 2 4 6 1 2 3 4 5 1 2 3 4 5 Gradient mismatch Difference in test error Task Arithmetic Ours RoBERTa on IMDB What if we merge fine-tuned large-language models?
  31. SAM as an Optimal relaxation of Bayes 36 ρ θ

    Bayes: 𝔼 ϵ∼ 𝒩 (0,σ2) [ℓ(θ + ϵ)] sup |ϵ|<ρ ℓ(θ + ϵ) SAM: Our work: Fenchel Biconjugate 1. Foret et al. Sharpness-Aware Minimization for Efficiently Improving Generalization, ICLR, 2021 2. Moellenhoff and Khan, SAM as an Optimal Relaxation of Bayes, Under review, 2022
  32. Bayesian Learning Rule [1] • Bridge DL & Bayesian learning

    [2-5] – SOTA on GPT-2 and ImageNet [5] • Improve DL [5-7] – Calibration, uncertainty, memory etc. – Understand and fix model behavior • Towards human-like quick adaptation 37 1. Khan and Rue, The Bayesian Learning Rule, JMLR (2023). 2. Khan, et al. Fast and scalable Bayesian deep learning by weight-perturbation in Adam, ICML (2018). 3. Osawa et al. Practical Deep Learning with Bayesian Principles, NeurIPS (2019). 4. Lin et al. Handling the positive-definite constraints in the BLR, ICML (2020). 5. Shen et al. Variational Learning is Effective for Large Deep Networks, Under review. 6. Daheim et al. Model merging by uncertainty-based gradient matching, ICLR (2024). 7. Nickl, Xu, Tailor, Moellenhoff, Khan, The memory-perturbation equation, NeurIPS (2023)
  33. 39 The webpage is available at https://bayesduality.github.io/, and Twitter account

    @BayesDuality Received total funding of around USD 3 million through JST’s CREST-ANR (2021-2027) and Kakenhi Grants (2019-2021).
  34. 40 Team Approx-Bayes https://team-approx-bayes.github.io/ Many thanks to our group members

    and collaborators (many not on this slide). We are always looking for new collaborations.