ਓखͰใुΛఆٛ͢Δ͜ͱͳ͘ʴσϞيಓ༻ҙ͢Δ͜ͱͳ͘ɼ AIΛͬͯୡ͍ͨ͠λεΫͷใुΛਪఆʂ Published as a conference paper at ICLR 2024 Figure 2: EUREKA takes unmodified environment source code and language task description as context to zero-shot generate executable reward functions from a coding LLM. Then, it iterates between reward sampling, GPU-accelerated reward evaluation, and reward reflection to progressively improve its reward outputs. domain expertise to construct task prompts or learn only simple skills, leaving a substantial gap in achieving human-level dexterity (Yu et al., 2023; Brohan et al., 2023). On the other hand, reinforcement learning (RL) has achieved impressive results in dexter- ity (Andrychowicz et al., 2020; Handa et al., 2023) as well as many other domains-if the human designers can carefully construct reward functions that accurately codify and provide learning signals e.g. Eureka LLMͰใुؔΛ࠷దԽ https://eureka-research.github.io RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences Preference predictor Fallible teacher Noisy preferences Reward learning from denoised preferences Denoising Discriminator Collecting preferences Phase 1: Pre-training for agent and reward model Replay buffer for pre-training Warm start with intrinsic rewards Phase 2: Online training for reward model Agent Env Intrinsic reward Figure 1. Overview of RIME. In the pre-training phase, we warm start the reward model ˆ rω with intrinsic rewards rint to facilit smooth transition to the online training phase. Post pre-training, the policy, Q-network, and reward model ˆ rω are all inherited as i configurations for online training. During online training, we utilize a denoising discriminator to screen denoised preferences for ro reward learning. This discriminator employs a dynamic lower bound ωlower on the KL divergence between predicted preferences Pω annotated preference labels ˜ y to filter trustworthy samples Dt , and an upper bound ωupper to flip highly unreliable labels Df . vergence between predicted and annotated preference labels to filter samples. Further, to mitigate the accumulated er- ror caused by incorrect filtration, we propose to warm start the reward model during the pre-training phase for a good (Lee et al., 2023), and reinforcement learning (Christ et al., 2017; Ibarz et al., 2018; Hejna III & Sadigh, 20 In the context of RL, Christiano et al. (2017) propos comprehensive framework for PbRL. To improve feedb e.g. RIME ڭࢣ͕ൺֱɾબͨ݁͠Ռ͔Βใुਪఆ https://proceedings.mlr.press/v235/cheng24k.html