Vein of AlphaGo

AI milestone 1

PROFILE skydome20 Author of R 系列筆記 2011 – 2015 :
NCKU CSIE SHAO-YEN HUNG (洪紹嚴) 2015 – 2018 : NCKU IMIS 2018 – Now : CTL Data Science; R and Python Learner https://www.linkedin.com/in/skydome20/ 2

Prerequisite 2016-03-09 : AlphaGo 與李世乭首戰直播畫面 3

Prerequisite 《Mastering the game of Go with deep neural networks
and tree search》 4

https://www.inside.com.tw/2017/11/10/aja-alphago-zero 5

黃士傑 (Aja Huang) https://www.facebook.com/aja.huang • 台師大資訊工程研究所第一屆的學生，研讀碩士跟博士 • 2003 - 碩士論文為《電腦圍棋打劫的策略》
• 2011 - 博士論文為《應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法》 • 開發出圍棋軟體 Erick，打敗當時最強的軟體 Zen • 被 Deepmind 的 David Sliver 挖角，第 40 號員工 • 2014 – Deepmind 被 Google 收購 • 2014 ~ 2015 -重啟圍棋的人工智慧計畫，引入深度學習的技術 • 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 • 2016 – AlphaGo 出世，驚艷世人 • 《Mastering the game of Go with deep neural networks and tree search》 6

Aja Hung & Rémi Coulom 2015 2016/2017 《Efficient Selectivity and
Backup Operatorsin Monte-carlo Tree Search》 Aja Hung & Erick 《應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法》 2006 2011 http://pr.ntnu.edu.tw/news/index.php?mode=data&id=16487 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 Aja Hung & 李世乭《Mastering the game of Go with deep neural networks and tree search》 Aja Hung & Erick CrazyStone Erick AlphaGo / AlphaGo Zero Policy Network 《Mastering the Game of Go without Human Knowledge》 7

MCTS Policy Network Value Network Monte Carlo Tree Search (Rollout)
Self play 8

Policy Network 9

黃士傑 (Aja Huang) 2015 -《Move Evaluation in Go Using Deep
Convolutional Neural Networks》 https://www.facebook.com/aja.huang Policy Network 11

Problem Description 棋盤上有 19 x 19 = 361 個點，每個點有三種狀態 (白:
1) (黑:-1) (無子: 0) 棋盤狀態： റ = (1, 0, -1, 0, ……) 假設在某一狀態 റ 之下，暫不考慮無法落子的地方那下一步能走的空間是 361 維落子動作： റ = (0, 0, 1, 0, ……) (圍棋問題的定義) 任一狀態 റ 下，尋找最優落子策略 റ ，獲得最大的地盤 12

Policy Network റ = (1, 0, -1, 0, ……) റ
= (0, 0,……, 1, 0, ……) Supervised Learning 13

Policy Network 14 160,000 games 29.4 million (s,a) pairs

Policy Network 15

Policy Network Policy Network (用 CNN 學習人類下棋的方式) റ = ℎ(
റ ) Input റ 19 x 19 x 36 Model Convolution Neural Network Output റ 19 x 19 16

Policy Network റ = ℎ( റ ) 問題1: 獲得的棋譜皆來自於「業餘者」？問題2:
此網絡的棋力約為「業餘六段」問題3: 無法打敗當時最強的 CrazyStone 17

Monte Carlo Tree Search (Rollout) 18

Rémi Coulom • 2006 - 《Efficient Selectivity and Backup Operatorsin
Monte-carlo Tree Search》 • 用蒙地卡羅樹搜尋，開發出 CrazyStone • 2011 - 指導黃士傑的博士論文《應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法》 • 黃士傑開發出 Erick • 2014~2015 - 領導 Deepmind 的 AlphaGo 開發團隊 20

Rémi Coulom https://www.inside.com.tw/2016/03/10/why-do-i-second-alphago 21

Rémi Coulom 2006 -《Efficient Selectivity and Backup Operatorsin Monte-carlo Tree
Search》 Monte Carlo Tree Search 22

23 Monte Carlo Tree Search

Stanislaw Ulam Monte Carlo Tree Search Nicholas Constantine Metropolis 《The
Monte Carlo Method》 24

Monte Carlo Tree Search 25

Monte Carlo Tree Search Build up a tree and do
tree search Upper Confidence Bound 26

Monte Carlo Tree Search …… …… Black 勝數 / 次數
Random 27

Monte Carlo Tree Search Black …… …… Win : 1
勝數 / 次數 28

Monte Carlo Tree Search Black 1/1 …… …… Win :
1 勝數 / 次數 29

Monte Carlo Tree Search Black 1/1 …… …… 勝數 /
次數 Lose : 0 Take action by value 30

次數 Lose : 0 31

Monte Carlo Tree Search 1/2 …… …… Black 勝數 /
次數 32

次數 33

次數 Win : 1 34

Monte Carlo Tree Search Black 1/2 1/1 …… …… 勝數
/ 次數 Win : 1 35

Monte Carlo Tree Search Black 5/9 1/8 1/22 3/5 ……
…… 4/7 勝數 / 次數 36

…… 4/7 勝數 / 次數 37

…… 4/7 勝數 / 次數 Lose : 0 38

…… 4/7 勝數 / 次數 0/1 Lose : 0 39

…… 4/7 勝數 / 次數 0/1 Take action by value or random 40

Monte Carlo Tree Search White … … … … …
… … … … … … … … Black … … … … … … … … … … … … … Black First White Black White Black Take Action Observe Enemy Combined as one MCT 41

Monte Carlo Tree Search Black win / White win r
r 42 資料來源：Wiki

tree search Upper Confidence Bound (UCB) 44

Monte Carlo Tree Search Take action by value or random?
Exploration & Exploitation ? root 3/6 1/8 …… Bandit Algorithms action? Black/ White 45

Monte Carlo Tree Search (Multi-armed) Bandit Problem https://www.onlinecasinoselite.org/post/casino-snakes-the- social-media-rumor-of-the-moment 46

Monte Carlo Tree Search (Multi-armed) Bandit Problem Slot Machine (One-armed
bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) … … How can you maximize the expected gain after a series of action/choice ? 47

Monte Carlo Tree Search (Multi-armed) Bandit Problem 面對 K 個固定的吃角子老虎機
(推薦產品/廣告) 在沒有任何先驗知識、不知道各自的預期報酬之下每一次嘗試都選擇其中一個如何在這個選擇過程中，最大化我們的期望報酬？ 48

Monte Carlo Tree Search (Multi-armed) Bandit Algorithms ε-first Upper Confidence
Bound (UCB1 / UCB2) ε-greedy LinUCB Thompson sampling εn-greedy 49

Monte Carlo Tree Search UCB = ഥ + Upper Confidence
Bound • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數 50 ഥ ( ) δ δ Probability > α

Monte Carlo Tree Search (Multi-armed) Bandit Problem Slot Machine (One-armed
bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) … … 51 UCB = ഥ + • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數

Monte Carlo Tree Search • 對機器 i 來說，每一次選擇，都會調整 ഥ 跟
• 對各台機器來說，如果選擇次數一樣，那會選擇 ഥ 最高的那一台 • 隨著某一台機器被挑到多次： • 表示它平均歷史報酬 ഥ 很高：Exploit • 右邊那一項 δ = 會逐漸變小 • 直到另外一台 UCB 大於這一台時，改選擇另外一台：Explore UCB = ഥ + 52

Monte Carlo Tree Search UCB = ഥ + 53 root
… Action • by UCB

tree search Upper Confidence Bound ഥ + UCT 54 root … Action • by UCB

Build up a tree and do tree search UCB Q(s,
a) + u(s, a) • UCB update: Q(s, a) + u(s, a) value network reward after simulation ℎ( റ ) 55 UCT

Monte Carlo Tree Search for 56 《Mastering the game of
Go with deep neural networks and tree search》 probability

Monte Carlo Tree Search 2016-03-13 第四戰: 神之一手：李世石第 78 手 Netflix
– 《AlphaGo 世紀對決》 https://www.101weiqi.com/chessbook/chess/140037/ 78 57

Infinite branches problem? The branches are almost infinite; thus, the
pruning issue is essential ! 58

Value Network We need value function : v(s) to early
stop the game ! 59 value network reward after simulation

Value Network & Self-Play 60

Value Network 若開局走了 100 步後，v(s) 可以判斷最後是贏還輸 (r) 那就不必走完全局，可以直接中止並反傳遞回去更新蒙地卡羅樹上的節點值 =
Q(s, a) 61

Value Network 62 《Mastering the game of Go with deep
neural networks and tree search》

Value Network 然而問題是：人類現有的棋譜太少，不足以構建出這個 v(s) Self-play 63

Value Network – Self play ℎ ( റ ) N
data Policy Network ℎ−1 ( റ ) Policy Network ℎ−2 ( റ ) N data Policy Network N data …… Policy Network ℎ− ( റ ) N data 64

Value Network Initially, value network is copied from policy network
65 《Mastering the game of Go with deep neural networks and tree search》

Value Network – Self play 結果：ℎ−( റ ) 對上 ℎ(
റ ) =有 80% 勝率問題：拿 ℎ− ( റ ) 來優化 MCTS，棋力反而變差 Q(s, a) + u(s, a) ℎ− ( റ ) 觀察： ℎ− ( റ ) 步數太過集中，無法 Explore 66 generate 30 million games

E Value Network – Self play (黃博士最後在訓練 v(s) 時所採取的策略) 1.
先用 ℎ( റ ) 走 L 步 2. 在 L+1 步時，「完全隨機」走一步 3. 之後才用 ℎ−( റ ) 走完終局 67

Value network Q-value in MCTS 68

v(s’) Q(s, a) [λ = 0] Q(s, a) [λ =
1] 69

ℎ− ( റ ) The percentage of frequency that the
action was selected during simulations The real situation 70

MCTS Policy Network Value Network Monte Carlo Tree Search (Rollout)
Self play 71

Human data 30,000,000 Policy Network Value Network (Initial Copy) ……
Self-Play (Take action) MCTS Q(s, a) + u(s, a) ⇒ (Early stop) v(s) ℎ ( റ ) () (win/lost) trainning trainning 19 x 19 x 48 72 generate 30 million games

AlphaGo ZERO 73

黃士傑 (Aja Huang) 2017 -《Mastering the Game of Go without
Human Knowledge》 https://www.facebook.com/aja.huang ZERO 74

https://www.inside.com.tw/2017/10/19/alphago-zero https://www.nature.com/articles/nature24270 75

ZERO 76 《Deep Residual Learning for Image Recognition》 Policy Network
+ Value Network ResNet

One Network • ResNet • 20-40 error blocks • Batch
Normalization • Rectifier non-linearities • Input: only black/white info • 19 x 19 x 17 • Output: (p, v) • p = actions probability • v = odds for current player • Loss: …… Self-Play ZERO training (reward, ) (p, v) action/eval (Take action) MCTS Q(s, a) + u(s, a) ⇒ (Early stop) v 77 generate 29 million games

ZERO 《Mastering the Game of Go without Human Knowledge》 78

79 19 x 19 x 48 《Mastering the Game of
Go without Human Knowledge》《Mastering the game of Go with deep neural networks and tree search》

ZERO 176 GPUs 48 TPUs ≈ 720 4 TPUs AlphaGo
Fan AlphaGo Lee AlphaGo Zero 0 100 200 300 400 500 600 (GPUs) 700 800 In Google Cloud 1 TPU ≈ 15 ~ 30 GPUs 《Mastering the Game of Go without Human Knowledge》 81

82 Policy Network  SL  CNN MCTS  Build
tree  UCB  Q(s, a)  u(s, a) Value Network  CNN  Early stop Self-Play  Generate data  Train network  Update MCTS ResNet  CNN  (p, v) Result  Insights  GPU, TPU ZERO

84 深入浅出看懂AlphaGo如何下棋深入浅出看懂AlphaGo元《電腦圍棋打劫的策略》《Mastering the game of Go with
deep neural networks and tree search》《應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法》《Move Evaluation in Go Using Deep Convolutional Neural Networks》《Efficient Selectivity and Backup Operatorsin Monte-carlo Tree Search》《The Monte Carlo Method》《Algorithms for the multi-armed bandit problem》《Mastering the Game of Go without Human Knowledge》《Deep Residual Learning for Image Recognition》賭徒的人工智慧1：吃角子老虎 (Bandit) 問題 Monte Carlo Tree Search – beginners guide 构建自己的AlphaGo 《New Exponential Bounds and Approximations for the Computation of Error Probability in Fading Channels》

End 85

Backup 86

Prerequisite Prerequisite (Reinforcement Learning) • Bellman equation: V(s), Q(s,a) •
Epsilon-greedy • TD, Q-Learning • Policy Gradient • DQN, Actor Critic (Deep Learning) • CNN • Filter, Pooling, Flatten, FC • Relu • Adam • Batch Normalization • VGG-19, ResNet (Python: implement) • numpy / pandas • Tensorflow / Pytorch • Keras • GPU / TPU • Google Cloud 87

Monte Carlo Tree Search ε-greedy ε-first 88

Monte Carlo Tree Search εn-greedy 89

Monte Carlo Tree Search UCB = ഥ + Upper Confidence
Bound • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數 90 ഥ ( ) δ δ Probability > α

Monte Carlo Tree Search P (1 σ=1 ) − ()
< δ > α ----(1) ⇒P (−δ < σ=1 −() < δ) > α -----(2) (Recall) Central Limit Theorem: ത −μ σ = ～ 0, 1 ത −μ σ = ( ത −μ) σ = ( ത − μ) = ( ത − μ ) = σ=1 −() ～ 0, 1 代回(2) Upper Confidence Bound 91

Monte Carlo Tree Search P (−δ < σ=1 −() <
δ) > α ----- (2) ⇒ P (−δ < σ=1 −() < δ ) = න −δ δ 1 2 − 2 2 = erf( δ 2 ) > α ---- (3) [1]:《New Exponential Bounds and Approximations for the Computation of Error Probability in Fading Channels》 0, 1 [1]: erfc < exp −2 erf( δ 2 ) = 1 − erfc δ 2 > 1 − exp − δ 2 2 = 1 − exp − δ2 2 > α (4) Upper Confidence Bound 92

Monte Carlo Tree Search 1 − exp − δ2 2
> α ---- (4) exp − δ2 2 < (1 − α) ⇒ − δ2 2 < ln(1 − α) ⇒ δ2 < −2 ln(1 − α) = 2 ln( 1 1 − α) ⇒ δ < 2 ln( 1 1 − α ) Upper Confidence Bound 93

Monte Carlo Tree Search Bigger α is better Thus, let
1 1 − α = N δ < 2 ln( 1 1 − α ) ⇒ δ< 2 ln() UCB = ഥ + LCB = ഥ − ഥ ( ) δ δ Probability > α Upper Confidence Bound 94

95 《Mastering the game of Go with deep neural networks
and tree search》

ZERO separate (“sep”) combined policy and value networks (“dual”) convolutional
(“conv”) residual networks (“res”) 《Mastering the Game of Go without Human Knowledge》 96

ZERO 4 (TPUs) x 40(hrs) x 6.5 (USD/hr) = 1040
(USD) 98

Monte Carlo Tree Search 選擇（Selection）：從根結點R開始，選擇連續的子結點向下至葉子結點L。後面給出了一種選擇子結點的方法，讓遊戲樹向最優的方向擴充功能，這是蒙地卡羅樹搜尋的精要所在。資料來源：Wiki 99 R

Monte Carlo Tree Search 擴充功能（Expansion）除非任意一方的輸贏使得遊戲在L結束，否則建立一個或多個子結點並選取其中一個結點C 100 資料來源：Wiki C

Monte Carlo Tree Search 仿真（Simulation）：在從結點C開始，用隨機策略進行遊戲，又稱為playout或者rollout 101 資料來源：Wiki C

Monte Carlo Tree Search 反向傳播（Backpropagation）：使用隨機遊戲的結果，更新從C到R的路徑上的結點資訊 102 資料來源：Wiki C R

“MAKE THE WORLD SMALLER, ONE SLIDE AT THE TIME” READY
TO BOOST THE IMPACT OF YOUR PRESENTATIONS? Head over to 24Slides.com and find out more about our presentation design services. 104 https://24slides.com/templates/dashboard/view/other/dark- themed-30-slide-template-pack

Vein of AlphaGo

Vein of AlphaGo

More Decks by skydome20

Other Decks in Research

Featured

Transcript