+ γ2r t+2 + γ3r t+3 + …. • γ=1 (今の報酬と未来の報酬が同じ大事さ) R t = r t + r t+1 + r t+2 + r t+3 + …. • γ=0.9 (遠い未来は考えない) R t = r t + 0.9r t+1 + 0.81r t+2 + 0.729r t+3 + …. • γ=0 (今を生きる) R t = r t 27
Temporal Difference Learning: The Successor Representation, 1993 • Successor Features for Transfer in Reinforcement Learning, 2016 • Actor-Mimic Deep Multitiask and Transfer Reinforcement Learning, 2016 • PathNet: Evolution Channels Gradient Descent in Super Neural Networks, 2017 • Playing FPS Games with Deep Reinforcement Learning, 2016 • Reinforcement Learning with Unsupervised Auxiliary Tasks, 2016 • Learning To Navigate in Complex Environments, 2017 • Learning to Act by Predicting the Future, 2017 74
Representation [Dayan+ 1993] 強化学習において、汎用的に使える特徴量を使いたい →未来の状態を推定する表現 x 報酬の重みベクトルの内積で 価値を推定するように学習すると探索タスクで有効 • Successor Features for Transfer in Reinforcement Learning [Barret+ 2016] 深層強化学習においても未来の状態を推定する表現を用いた →環境は同じだが報酬の発生の仕方が変わるという条件では 報酬の重みベクトルだけ学習し直すことで効率よく学習できる 75
2016] ゲームに有利な知識も学習に含めたい →ゲームの情報(アイテム数とか)をDQNのCNNを共有して 教師あり学習を行うことでゲームに有用な特徴量を獲得する • Learning to Act by Predicting the Future [Dosovitskiy+ 2017] →未来のゲームの情報とゴール(弾薬を集めるなど) も含めて学習を行った VizDoom AI Competitionで圧倒的な1位を獲得 81
Hippocampal Contributions to Control: The Third Way [Lengyel+ 2007] 強化学習において、学習序盤は目標の計算に推測値を使っている (bootstrap)ので学習効率が悪い →Episodic Controlの学習初期における優位性を主張 報酬が得られた行動シーケンスを保存して、そのシーケンスを 再生することで学習序盤の学習性能を改善する可能性を示した 91
Curiosity and Boredom in Model-Building Neural Controllers, 1991 • Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015 • Action-Conditional Video Prediction using Deep Networks in Atari Games, 2015 • Unifying Count-based Exploration and Intrinsic Motivation, 2016 • A Study of Count-Based Exploration for Deep Reinforcement Learning, 2017 • Curiosity-driven Exloration by Self-supervised Prediction, 2017 109
あまり訪れていない場所に行くと追加で報酬を発生させたい →DNNで入力状態にどのくらい訪れたことがあるかを推定する • A Study of Count-Based Exploration for Deep Reinforcement Learning [Tang+ 2017] もっと単純に回数を保存したい →入力状態をハッシュ関数で次元圧縮して ハッシュテーブルに訪れた回数を保存する 提案したハッシュ関数 1. SimHash 2. AutoEncoder 115
Environment: An evaluation platform for general agents." J. Artif. Intell. Res.(JAIR)47 (2013): 253-279. • Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). • Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. • Lin, Long-Ji. Reinforcement learning for robots using neural networks. No. CMU-CS-93-103. Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993. • Nair, Arun, et al. "Massively parallel methods for deep reinforcement learning." arXiv preprint arXiv:1507.04296 (2015). • Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016. • Babaeizadeh, Mohammad, et al. "Reinforcement learning through asynchronous advantage actor-critic on a gpu." (2016). 146
function approximation for reinforcement learning." Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum. 1993. • Hasselt, Hado V. "Double Q-learning." Advances in Neural Information Processing Systems. 2010. • Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. 2016. • Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015). • Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint arXiv:1511.05952 (2015). • Li, Yuxi, and Dale Schuurmans. "MapReduce for Parallel Reinforcement Learning." EWRL. 2011. 147
in reinforcement learning." arXiv preprint arXiv:1606.05312 (2016). • Parisotto, Emilio, Jimmy Lei Ba, and Ruslan Salakhutdinov. "Actor-mimic: Deep multitask and transfer reinforcement learning." arXiv preprint arXiv:1511.06342 (2015). • Rusu, Andrei A., et al. "Progressive neural networks." arXiv preprint arXiv:1606.04671 (2016). • Dutta, Bhaskar, Anders Wallqvist, and Jaques Reifman. "PathNet: a tool for pathway analysis using topological information." Source code for biology and medicine 7.1 (2012): 10. • Lample, Guillaume, and Devendra Singh Chaplot. "Playing FPS Games with Deep Reinforcement Learning." AAAI. 2017. • Dosovitskiy, Alexey, and Vladlen Koltun. "Learning to act by predicting the future." arXiv preprint arXiv:1611.01779 (2016). 148
The successor representation." Neural Computation 5.4 (1993): 613-624. • Jaderberg, Max, et al. "Reinforcement learning with unsupervised auxiliary tasks." arXiv preprint arXiv:1611.05397 (2016). • Mirowski, Piotr, et al. "Learning to navigate in complex environments." arXiv preprint arXiv:1611.03673 (2016). • Lengyel, Máté, and Peter Dayan. "Hippocampal contributions to control: the third way." Advances in neural information processing systems. 2008. • Blundell, Charles, et al. "Model-free episodic control." arXiv preprint arXiv:1606.04460 (2016). • Pritzel, Alexander, et al. "Neural Episodic Control." arXiv preprint arXiv:1703.01988 (2017). • urgen Schmidhuber, J. "A possibility for implementing curiosity and boredom in model-building neural controllers." From animals to animats: proceedings of the first international conference on simulation of adaptive behavior (SAB90). 1991. 149
"Incentivizing exploration in reinforcement learning with deep predictive models." arXiv preprint arXiv:1507.00814 (2015). • Oh, Junhyuk, et al. "Action-conditional video prediction using deep networks in atari games." Advances in Neural Information Processing Systems. 2015. • Bellemare, Marc, et al. "Unifying count-based exploration and intrinsic motivation." Advances in Neural Information Processing Systems. 2016. • Tang, Haoran, et al. "# Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning." arXiv preprint arXiv:1611.04717 (2016). • Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." arXiv preprint arXiv:1705.05363 (2017). 150
Reinforcement Learning, 2016 • Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, 2016 • Universal Value Function Approximation, 2016 • Subgoal Discovery for Hierarchical Reinforcement Learning Using Learned Policies, 2003 • Deep Successor Reinforcement Learning, 2016 • Beating Atari with Natural Language Guided Reinforcement Learning, 2017 • Micro-Objective Learning: Accelerating Deep Reinforcement Learning through the Discovery of Continuous Subgoals, 2017 152
Action Gap: New Oprerators for Reinforcement Learing, 2015 • Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening, 2016 164