• Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics. • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible • Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics [Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc] [Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06 [Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18 [Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98 [Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.