詳解強化学習 / In-depth Guide to Reinforcement Learning

Principles of Robot Intelligence from Nature/Nurture ৄղ ڧԽֶशͷൃలͱԠ༻ ϩϘοτ੍ޚɾήʔϜAIͷͨΊͷ࣮ફతཧ࿦ খྛ
ହհʢࠃཱ৘ใֶݚڀॴʗ૯߹ݚڀେֶӃେֶʣ mailto: [email protected] Web: https://prinlab.org/

Principles of Robot Intelligence from Nature/Nurture ࢀߟॻ ৄղ ڧԽֶश 2
https://www.amazon.co.jp/dp/4910558276 ࠷৽ͷڧԽֶशΞϧΰϦζϜΒΛ࣮༻ੑͷ؍఺͔Β੔ཧɾ঺հ 1. ڧԽֶशͱ͸ 2. ڧԽֶशͷجຊతͳ໰୊ઃఆ 3. جຊతͳֶशΞϧΰϦζϜ 4. ํࡦޯ഑๏ͷൃల Ø ࠷ۙͷఆ൪PPO, SACͳͲͷ ૂ͍ͱཧ࿦ 5. ϞσϧϕʔεڧԽֶश Ø ੈքϞσϧͷߏஙɾֶश๏ͱ ׆༻ࣄྫ 6. ใुઃܭͷ՝୊ͱରࡦ Ø ࣮༻࣌ʹ࠷΋ࠔΔใुؔ਺ʹ ؔ࿈͢Δٕज़ 7. ࠓޙͷల๬

Principles of Robot Intelligence from Nature/Nurture 1. ڧԽֶशͱ͸

Principles of Robot Intelligence from Nature/Nurture ྑ͋͘ΔAI ৄղ ڧԽֶश 4
ೖྗσʔλʹରͯ͠ਖ਼ղͱͳΔग़ྗσʔλͷϖΞΛେྔʹؚΜͩσʔληοτΛ ༻͍ͨڭࢣ͋ΓֶशΛͯ͠ɼ৽نೖྗσʔλʹର͢Δग़ྗσʔλΛ༧ଌ͢Δ Black-box function (Oracle) Input Output Gorilla Deer Sunfish Tuna Flan Tai-yaki

Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͱ͸ ৄղ ڧԽֶश 5
ΤʔδΣϯτ͕ ະ஌ͷ؀ڥͰ ࢼߦࡨޡ͍ͯ͘͠தͰ কདྷతʹಘΒΕΔऩӹΛ࠷େԽ͢ΔΑ͏ͳ ঢ়ଶʹԠͨ͡ํࡦΛֶश͢Δ u ΤʔδΣϯτɿֶश͢ΔओମʢϩϘοτɾήʔϜAIͳͲʣ u ؀ڥɿΤʔδΣϯτ͕׆ಈ͢Δ෣୆ u ऩӹɿ؀ڥʹજΉධՁػߏ͔ΒಘΒΕΔ֤࣌ࠁͷධՁ݁Ռʢใुʣͷྦྷੵ஋ u ঢ়ଶɿ؀ڥʢʴΤʔδΣϯτʣ͕Ͳ͏ͳ͍ͬͯΔ͔Λ୺తʹද͢σʔλ u ํࡦɿΤʔδΣϯτ͕؀ڥʹհೖ͢Δํ๏ʢߦಈʣͷҙࢥܾఆػߏ

Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͷ೉͠͞ ৄղ ڧԽֶश 6
u ؒ઀తͳڭࣔ Ø ڭࢣ͋ΓֶशͷΑ͏ʹɼ͋Δঢ়ଶʹର͢Δਖ਼ղͷߦಈ͸༩͑ΒΕͳ͍… Ø ୅ΘΓʹɼใु͕ࣔ͢ྑ͠ѱ͠ΛཔΓʹ࠷΋ྑ͍ߦಈΛ୳͢ u σʔλͷऩू Ø ڭࢣ͋ΓֶशͷΑ͏ʹɼࣄલʹֶश༻ͷσʔληοτ͕༻ҙ͞Ε͍ͯͳ͍… Ø ୅ΘΓʹɼΤʔδΣϯτࣗ਎͕৭ʑͳߦಈΛࢼͯ͠σʔλΛूΊΔ u ऩӹͷ༧ଌ Ø ใु͸ݱঢ়ͷྑ͠ѱ͚ͩ͠Λࣔ͠ɼকདྷͷྑ͠ѱ͠͸ڭ͑ͯ͘Εͳ͍… Ø ୅ΘΓʹɼΤʔδΣϯτͷํࡦʹ΋ґଘ͢ΔऩӹΛ༧ଌ͢Δ

Principles of Robot Intelligence from Nature/Nurture Q. ʢڊେʣίϯύεΛखͷͻΒʹ ৐ͤͯόϥϯεΛऔΔʹ͸ʁ

Principles of Robot Intelligence from Nature/Nurture ճ౴ྫ ৄղ ڧԽֶश 8
u ίϯύεͷ౗ΕΔํ޲ʹઌճΓ͢ΔΑ͏ʹखΛಈ͔͢ u ઌʹखΛपظతʹಈ͔ͯ͠ίϯύεͷಈ͖ΛύλʔϯԽ u ίϯύεͷ਑ΛखͷͻΒʹࢗ͢ Ø ʮকདྷΛߟྀͯ͠౎߹ͷྑ͍΋ͷʹಋ͘ʯΑ͏ߦಈ͍ͯ͠Δʂ ڧԽֶशͰ࠷΋ʁॏཁͳલఏ

Principles of Robot Intelligence from Nature/Nurture ڧԽֶश͕ʢ޻෉ͳ͠ʹʣղ͚Δલఏ৚݅ ৄղ ڧԽֶश 9
ঢ়ଶ͕ߦಈʹΑͬͯͲ͏ભҠ͢Δͷ͔֬཰తʹ༧ଌͰ͖Δ͜ͱ -> ϚϧίϑܾఆաఔʢMarkov Decision Process; MDPʣ ྫ୊ɿ࠷୹ܦ࿏ܭը ঢ়ଶɿάϦου্ͷҐஔ ߦಈɿલޙࠨӈ΁ͷҠಈ ใुɿΰʔϧͱͷڑ཭

Principles of Robot Intelligence from Nature/Nurture Ұൠతͳঢ়ଶભҠ ৄղ ڧԽֶश 10
ঢ়ଶ𝑠ͷભҠ͸͜Ε·Ͱͷঢ়ଶཤྺʹै͏ -> શཤྺΛࢀরͯ͠ͷ༧ଌ͸ࠔ೉… ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ b) Ϛϧίϑաఔ ཚ୒ ৚݅

Principles of Robot Intelligence from Nature/Nurture ద੾ͳঢ়ଶΛఆٛ͢Δͱ… ৄղ ڧԽֶश 11
͜Ε·Ͱͷঢ়ଶཤྺ͕ͳͯ͘΋ঢ়ଶભҠΛ༧ଌՄೳʹʂ -> Ϛϧίϑաఔʢ͋Δ͍͸୯७Ϛϧίϑ࿈࠯ʣͱݺͿ ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ c) Ϛϧίϑܾఆաఔ

Principles of Robot Intelligence from Nature/Nurture ద੾ͳঢ়ଶΛఆٛ͢Δͱ… ৄղ ڧԽֶश 12
͜Ε·Ͱͷঢ়ଶཤྺ͕ͳͯ͘΋ঢ়ଶભҠΛ༧ଌՄೳʹʂ -> Ϛϧίϑաఔʢ͋Δ͍͸୯७Ϛϧίϑ࿈࠯ʣͱݺͿ ঢ়ଶભҠ֬཰ʢ࣌ࠁΛলུʣ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ༧ଌʹඞཁͳ৚݅ ݱ࣌ࠁ " a) Ұൠతͳ؀ڥͷঢ়ଶભҠ աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ c) Ϛϧίϑܾఆաఔ

Principles of Robot Intelligence from Nature/Nurture ߦಈͰঢ়ଶભҠʹհೖͰ͖Δͱ… ৄղ ڧԽֶश 13
डಈతʹ࣌ؒൃల͢Δࣄ৅͔ΒকདྷΛʢ͋Δఔ౓ʣૢ࡞ՄೳͳγεςϜʹʂ -> ϚϧίϑܾఆաఔͳΒߦಈ𝑎͸ঢ়ଶ𝑠ʹͷΈґଘܾͯ͠ΊΕ͹ྑ͍ ঢ়ଶભҠ֬཰ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ݱ࣌ࠁ " աڈʢෆཁʂʣ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ৚݅Λݱঢ়ଶͷΈʹ ݱ࣌ࠁ " b) Ϛϧίϑաఔ աڈ ະདྷ … !!"# !! !!$# !!$% !!$& !!$' ݱ࣌ࠁ " c) Ϛϧίϑܾఆաఔ #! ঢ়ଶભҠʹհೖʂ ΤʔδΣϯτ ߦಈ ͲΜͳߦಈ͕ॴ๬ͷ ঢ়ଶʹભҠͤ͞Δʁ

Principles of Robot Intelligence from Nature/Nurture ڧԽֶशͷϑϨʔϜϫʔΫ ৄղ ڧԽֶश 14
ϚϧίϑܾఆաఔͳΒɼݱঢ়ͷྑ͠ѱ͠Λࣔ͢ใुؔ਺͸𝑟(𝑠, 𝑎, 𝑠!)ͰදͤΔʂ ํࡦ𝜋(𝑎|𝑠)͸۩ମతʹͲΜͳ֬཰෼෍…ʁ ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!)

Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚Δํࡦ ৄղ ڧԽֶश 15
ߦಈ͕༗ݶͷબ୒ࢶͰఆٛ͞ΕΔ৔߹… -> 𝑖-൪໨ͷબ୒ࢶͷ༏ઌ౓𝜃# Λ࠷దԽͭͭ͠ɼͦΕʹج͍ͮͯબ୒֬཰Λઃܭʂ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө

Principles of Robot Intelligence from Nature/Nurture ࿈ଓߦಈۭؒʹ͓͚Δํࡦ ৄղ ڧԽֶश 16
ߦಈ͕ʢ͋Δൣғͷʣ࣮਺ϕΫτϧͰఆٛ͞ΕΔ৔߹… -> ୯७ͳਖ਼ن෼෍͕ఆ൪͕ͩɼॴ๬ͷಛੑʹԠͨ͡ઃܭΛ͢Δ͜ͱ΋ʂ †࠷ۙ͸֦ࢄϞσϧͳͲͰे෼ͳදݱྗΛ֬อ͢Δ͜ͱ΋ ੍ޚೖྗͷఆٛҬ ֬཰ີ౓ Ø ਖ਼ن෼෍ɿ ୯७͞ Ø t෼෍ɿ ੄ͷॏ͞ Ø ϕʔλ෼෍ɿ ༗քੑ Ø ࠞ߹෼෍ɿ ଟๆੑ

Principles of Robot Intelligence from Nature/Nurture ํࡦ͕֬཰෼෍Ͱ͋Δඞཁੑ ৄղ ڧԽֶश 17
ΤʔδΣϯτ͕ࢼߦࡨޡ༷ͯ͠ʑͳܦݧΛಘΔʹ͸ɼ֬཰తͳڍಈ͕୯७͕ͩ༗ޮʂ † Exploration-Exploitation Dilemmaʢ୳ࡧͱ஌ࣝར༻ͷδϨϯϚʣʹ஫ҙ ࠜؾΑ͘୳ࡧ͠ͳ͍ͱݟ͔ͭΒͳ͍… ͜Ε͕࠷దͱ͸ݶΒͳ͍…

Principles of Robot Intelligence from Nature/Nurture ํࡦͷ࠷దԽʹऩӹͷ࠷େԽ ৄղ ڧԽֶश 18
͜ͷઌ؀ڥ͔Β໯͑ΔใुͷྦྷܭʢऩӹʣΛ࠷େԽ͢ΔΑ͏ɼݱঢ়ͷํࡦΛ࠷దԽʂ -> ͲΜͳऩӹΛ໯͑Δ͔༧ଌͰ͖ͳ͍ͱɼ࠷େԽͷํ๏΋Θ͔Βͳ͍… ਺஋తൃࢄͷϦεΫ কདྷͷใुͷՁ஋Λগͳ͘͢Δ͜ͱͰ਺஋҆ఆԽ ׂҾ཰ 𝛾 ∈ [0, 1) ऩӹ

Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷಋೖ ৄղ ڧԽֶश 19
ঢ়ଶભҠ֬཰𝑝" ͱํࡦ𝜋ͰભҠ͠ଓ͚ͨࡍʹඳ͔ΕΔ ֬཰తͳيಓ𝜏ʹରԠ͢Δऩӹͷظ଴஋ͱͯؔ͠਺Խʂ ʢঢ়ଶʣՁ஋ؔ਺ ʢঢ়ଶʣ-ߦಈՁ஋ؔ਺ ظ଴஋ԋࢉ ࣌ؒ 𝑡 ঢ়ଶ 𝑠 𝑠! ऩӹ ߴ ௿ يಓ 𝜏

Principles of Robot Intelligence from Nature/Nurture Tips: ϚϧίϑܾఆաఔΛຬͨ͢ʹ͸ʁ ৄղ ڧԽֶश
20 ద੾ͳঢ়ଶɾߦಈۭؒΛઃܭ͢Ε͹ྑ͍ λεΫͷࢦఆ ʴ औΓ͏Δߦಈͷબఆ ඞཁͳঢ়ଶͷݕূ ؍ଌՄೳʁ ηϯαʹΑΔ؍ଌ ਪఆՄೳʁ Yes No ਪఆثɾ୅ସηϯαͷಋೖ Yes No ؀ڥͷݟ௚͠ 𝑠! (𝑖 = 1,2, … )

Principles of Robot Intelligence from Nature/Nurture Tips: औΓ͏Δߦಈͷબఆʢ≒؀ڥͱΤʔδΣϯτͷڥքઃఆʣ ৄղ ڧԽֶश
21 ୡ੒͍ͨ͠λεΫʹ௚݁͢Δ΋ͷΛબͿͱֶशޮ཰͕޲্͠΍͍͢ ʢྫɿ෺ମૢ࡞→खઌҐஔɾ଎౓ or ೺࣋ର৅෺ʣ ߦಈͷީิ u࿈ଓߦಈۭؒ Ø ؔઅ֯౓ɾ֯଎౓ɾτϧΫ Ø खઌҐஔɾ଎౓ɾྗ Ø PDήΠϯɾΠϯϐʔμϯεύϥϝʔλ Ø ͳͲͳͲ… u཭ࢄߦಈۭؒ Ø Քಇ͢ΔؔઅIDʴճసํ޲ Ø खઌҠಈํ޲ʢࠨӈɾ্ԼɾԞखલʣ Ø ϓϦηοτ͞Εͨύϥϝʔλ Ø ೺࣋͢Δର৅෺ମID Ø ͳͲͳͲ… ྫɿϩϘοτΞʔϜʹΑΔ෺ମૢ࡞

Principles of Robot Intelligence from Nature/Nurture Tips: ඞཁͳঢ়ଶͷݕূ ৄղ ڧԽֶश
22 ୡ੒͍ͨ͠λεΫ΍ઃܭͨ͠ߦಈۭؒʹґͬͯɼকདྷ༧ଌʹඞཁͳঢ়ଶ͸ҟͳΔ -> υϝΠϯ஌ࣝΛੵۃతʹ׆༻ͯ͠ɼఆੑతʹͰ΋ҼՌؔ܎Λઌʹ໌Β͔ʹ͢Δʂ u λεΫྫʹݻఆ͞Εͨ෺ମ΁ͷϦʔνϯά Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ Ø ߦಈྫʹؔઅτϧΫɿ -> ؔઅ֯౓ɾ֯଎౓ or खઌҐஔɾ଎౓ u λεΫྫʹϥϯμϜ഑ஔͷ෺ମ΁ͷϦʔνϯά Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ + ෺ମҐஔ u λεΫྫʹিಥճආ͠ͳ͕Βͷର৅෺೺࣋ Ø ߦಈྫʹखઌ଎౓ɿ -> ؔઅ֯౓ or खઌҐஔ + ର৅෺Ґஔɾ࢟੎ + ೺࣋ྗ + يಓपลͷো֐෺Ϛοϓ ྫɿϩϘοτΞʔϜʹΑΔ෺ମૢ࡞

Principles of Robot Intelligence from Nature/Nurture Tips: ΑΓ۩ମతͳྫ ৄղ ڧԽֶश
23 † https://gymnasium.farama.org/environments/classic_control/pendulum/ u λεΫ Ø ৼΓࢠͷৼΓ্͛ɾ౗ཱ = ใुɿৼΓࢠͷ֯౓͕௚্ u ߦಈ Ø ࠜຊͷ࣠΁ͷτϧΫ 𝜏 u ঢ়ଶ Ø ৼΓࢠͷ֯౓ 𝜃 Ø ৼΓࢠͷ֯଎౓ ̇ 𝜃 Ø ʢঢ়ଶʹఆ਺͸ෆཁʣ u ؍ଌ Ø ֯౓ɿΤϯίʔμ Ø ֯଎౓ɿδϟΠϩηϯα ӡಈํఔࣜ 𝐽 𝑚, 𝑙 ̈ 𝜃 = 𝜏 + 𝑚𝑔𝑙 sin(𝜃) ΦΠϥʔ๏ʹΑΔߋ৽ ̇ 𝜃 ← ̇ 𝜃 + ̈ 𝜃Δ𝑡, 𝜃 ← 𝜃 + ̇ 𝜃Δ𝑡 Pendulum-v0† ׳ੑ߲ʢఆ਺ʣ ॏྗ߲΁ͷ܎਺ʢఆ਺ʣ ࣌ؒεςοϓִؒʢఆ਺ʣ ໨ඪ஋ʢఆ਺ʣ

Principles of Robot Intelligence from Nature/Nurture Tips: ෦෼؍ଌϚϧίϑܾఆաఔʢPartially Observable MDPʣ
ৄղ ڧԽֶश 24 ঢ়ଶશͯΛ؍ଌͰ͖Δ͚ͩͷηϯα͕෇͍͍ͯͳ͍৔߹… -> ଞ৘ใ͔Βঢ়ଶΛਪఆ͢ΔඞཁʢRNN͕ྑ͘࢖ΘΕΔʣ ࣌ؒεςοϓ 𝑡 … ޙํࠩ෼ʹΑΔ֯଎౓ਪఆ ̇ 𝜃% ≃ 𝜃% − 𝜃%&' Δ𝑡 † 𝜃% ← Image% ֯౓ΛΧϝϥը૾͔Βਪఆ

Principles of Robot Intelligence from Nature/Nurture Tips: ੜ෺΋ڧԽֶश͍ͯ͠Δʁʂ ৄղ ڧԽֶश
25 https://youtu.be/ArdDdIPvy6c

Principles of Robot Intelligence from Nature/Nurture 2. ֶशΞϧΰϦζϜͷجૅ

Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷֶशɿڭࢣ͋Γֶश ৄղ ڧԽֶश 27
ͦͷؾʹͳΕ͹ਖ਼ղͱͳΔऩӹ͸ࢉग़Մೳ͕ͩɼ୅ସखஈ΋ߟ͑ΒΕΔ… !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ b) TD法 · · · · · · a) ϞϯςΧϧϩ๏ b) TD๏

Principles of Robot Intelligence from Nature/Nurture ϞϯςΧϧϩ๏ʢҰൠʣ ৄղ ڧԽֶश 28
ظ଴஋ԋࢉ͸ɼ֬཰෼෍͔Βཚ୒ͨ֬͠཰ม਺ʹର͢Δؔ਺஋ͷฏۉͰۙࣅͰ͖Δʂ -> ڧԽֶशͰසൃ͢Δظ଴஋ԋࢉͷେ൒͸͜ΕͰ਺஋ܭࢉՄೳʹ͢Δ 1) αΠίϩ౤͛ 2) ๅ͘͡γϛϡϨʔλ

Principles of Robot Intelligence from Nature/Nurture Ձ஋ؔ਺ͷֶशʹ͓͚ΔϞϯςΧϧϩ๏ ৄղ ڧԽֶश 29
ࢼߦࡨޡΛதஅ͢Δ·ͰͷيಓΛੜ੒͠ɼঢ়ଶ𝑠# ͔Βͷऩӹ𝑅# Λࢉग़ -> ਅ஋ͳͷͰภࠩ͸ͳ͍͕ɼϥϯμϜੑ͕ڧ͘෼ࢄ͕େ͖͍ !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! ֶशର৅ ֶश໨ඪ b) TD法 Τϐιʔυ ϞϯςΧϧϩ๏ʹ͓͚ΔՁ஋ؔ਺ͷଛࣦؔ਺ a) ϞϯςΧϧϩ๏

Principles of Robot Intelligence from Nature/Nurture ऩӹͷ࠶ؼߏ଄ʢϕϧϚϯํఔࣜʣ ৄղ ڧԽֶश 30
ऩӹɾՁ஋ؔ਺͸ใुͱ࣍࣌ࠁͷऩӹɾՁ஋ؔ਺ͰදݱՄೳʂ -> ؀ڥΑΓಘΒΕͨใु෼͚͔ͩ֬ͳ৘ใΛ࣋ͭͷͰɼڭࢣσʔλʹͳΓಘΔ

Principles of Robot Intelligence from Nature/Nurture TD๏ʹΑΔՁ஋ؔ਺ͷֶश ৄղ ڧԽֶश 31
ӈล − ࠨลʹTemporal Difference (TD)ޡࠩ𝛿Λθϩʹʂ -> ਪఆ஋ΛؚΉͷͰภࠩ͸ੜ͡Δ͕ɼϥϯμϜੑ͸ݮͬͯ෼ࢄ͸খ͍͞ !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ a) モンテカルロ法 · · · · · · !! "! #! !!"# # "!"# # #!"# # !!"# $ "!"# $ #!"# $ · · · !!"$ ## "!"$ ## #!"$ ## !!"$ #$ "!"$ #$ #!"$ #$ · · · !!"$ $# "!"$ $# #!"$ $# · · · · · · !!"% #$% "!"% #$% #!"% #$% ֶशର৅ ֶश໨ඪ b) TD法 · · · · · · TD๏ʹ͓͚ΔՁ஋ؔ਺ͷଛࣦؔ਺ ਺஋ͱͯ͠ར༻ ʢޯ഑͸ܭࢉ͠ͳ͍ʣ b) TD๏

Principles of Robot Intelligence from Nature/Nurture TD๏ͷΠϝʔδ ৄղ ڧԽֶश 32
࣌ࠁؒͰऩӹͷ༧ଌʹζϨ͕ੜ͡Δ͕… -> ऩӹΛਖ਼͘͠༧ଌͰ͖ΔͱɼTDޡࠩ͸ʢཧ૝తʹ͸ʣθϩʹऩଋ͢Δʂ Current time Action 𝑎" Next time >

Principles of Robot Intelligence from Nature/Nurture TD๏ͷΠϝʔδ ৄղ ڧԽֶश 33
࣌ࠁؒͰऩӹͷ༧ଌʹζϨ͕ੜ͡Δ͕… -> ऩӹΛਖ਼͘͠༧ଌͰ͖ΔͱɼTDޡࠩ͸ʢཧ૝తʹ͸ʣθϩʹऩଋ͢Δʂ Current time Next time < Action 𝑎#

Principles of Robot Intelligence from Nature/Nurture (Expected) SARSA / Qֶश
ৄղ ڧԽֶश 34 𝒂,ͷ༩͑ํͰΞϧΰϦζϜ͕෼ذ… u SARSA Ø ࣮ࡍʹબ୒ͨ͠ߦಈɿ𝑎$ = 𝑎! $ u Expected SARSA Ø ݱํࡦ𝜋ʹै͍બ୒͞ΕΔߦಈɿ𝑎$ ∼ 𝜋(𝑎|𝑠! $) u Qֶश Ø ϕετͱࢥΘΕΔߦಈɿ𝑎$ = argmax% 𝑄(𝑠! $, 𝑎) DQNʢQֶशͷਂ૚ڧԽֶश൛ʣ https://youtu.be/V1eYniJ0Rnk

Principles of Robot Intelligence from Nature/Nurture ҰൠԽɿnεςοϓTD๏ʗTD(𝜆)๏ ৄղ ڧԽֶश 35
࠶ؼߏ଄ʹͳ͍ͬͯΔͷͰɼnεςοϓઌ·ͰͷใुΛߟྀ͢Δ͜ͱ΋Մೳ ʴ nεςοϓTD๏ʢ𝑛 = 1,2, … , ∞ʣΛ𝜆 ∈ [0,1]ͰॏΈ෇͚ฏۉԽ -> ภࠩͱ෼ࢄͷτϨʔυΦϑΛௐ੔ग़དྷΔʂ nεςοϓTD๏ TD(𝜆)๏ ՙॏ࿨ TD(𝜆)๏ͷผදݱ

Principles of Robot Intelligence from Nature/Nurture ࠷దͳํࡦؔ਺ ৄղ ڧԽֶश 36
ऩӹͷظ଴஋ʹՁ஋ؔ਺Λ࠷େԽ͢Δ͜ͱΛ໨తͱ͢Δ -> Ձ஋ؔ਺͕ํࡦʹґଘ͢ΔͷͰɼᩦཉͳํࡦͰ͸ؒҧ͏͜ͱ΋… ํࡦ΁ͷґଘੑΛ໌ه ࠷దํࡦͷఆٛ ݱঢ়͔Βઌͷيಓੜ੒֬཰

Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚ΔߦಈՁ஋ؔ਺ʹجͮ͘ํࡦઃܭ ৄղ ڧԽֶश 37
୳ࡧʹ޲͚ͨ֬཰੒෼͸࢒ͭͭ͠ɼجຊతʹ͸ߦಈՁ஋ͷߴ͍ߦಈΛબ΂͹ྑ͍ʂ -> Լͷ𝜃Λ𝑄Ͱஔ׵͢Ε͹ઃܭ׬ྃ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө

Principles of Robot Intelligence from Nature/Nurture ཭ࢄߦಈۭؒʹ͓͚ΔߦಈՁ஋ؔ਺ʹجͮ͘ํࡦઃܭ ৄղ ڧԽֶश 38
୳ࡧʹ޲͚ͨ֬཰੒෼͸࢒ͭͭ͠ɼجຊతʹ͸ߦಈՁ஋ͷߴ͍ߦಈΛબ΂͹ྑ͍ʂ -> Լͷ𝜃Λ𝑄Ͱஔ׵͢Ε͹ઃܭ׬ྃ a) !-άϦʔσΟํࡦ b) ιϑτϚοΫεํࡦ "!͕࠷େͷ΋ͷ͚ͩߴ֬཰ ଞ͸Ұ༷ બ୒֬཰ ߦಈͷબ୒ࢶ 1 બ୒֬཰ ߦಈͷબ୒ࢶ 1 "!ͷେখΛ֬཰ʹ൓ө 𝑄! 𝑄" 𝑄! 𝑄! 𝑄"!

Principles of Robot Intelligence from Nature/Nurture ࿈ଓߦಈۭؒͷ৔߹ɿํࡦޯ഑๏ ৄղ ڧԽֶश 39
ํࡦΛϞσϧԽͨ֬͠཰෼෍ͷύϥϝʔλ𝜃ʹؔ͢Δޯ഑Λ࠶ؼతʹܭࢉ † ཭ࢄߦಈۭؒʹ΋ద༻Մೳʢੵ෼ԋࢉΛ૯࿨ԋࢉʹஔ׵ʣ ࠶ؼߏ଄ 𝑠΁ͷ౸ୡ֬཰ 𝑅ʹஔ׵͢Ε͹Ձ஋ؔ਺ෆཁ ํࡦޯ഑๏

Principles of Robot Intelligence from Nature/Nurture (Advantage) Actor-Critic๏ɿA2C ৄղ ڧԽֶश
40 ํࡦޯ഑ʹภࠩΛ༩͑ͳ͍ϕʔεϥΠϯͱͯ͠Ձ஋ؔ਺Λಋೖͯ͠෼ࢄΛ཈੍ Ξυόϯςʔδؔ਺𝐴 = 𝑄 − 𝑉ͰॏΈ෇͚ͨ͠ํࡦؔ਺ͷ࠷໬ਪఆ໰୊ͱղऍՄೳʹʂ TDޡࠩ𝛿ͰۙࣅՄೳ Actor ! Critic " or # ߦಈ ! ߋ৽ྔ ঢ়ଶ "ʢ࣍ঢ়ଶ "!ʣ ใु # (Advantage) Actor-Critic๏ ਖ਼͍͠ධՁ͕ͳ͍ͱํࡦ͸࠷దԽͰ͖ͳ͍ͷͰɼ𝛼) ≤ 𝛼* ͕Ұൠత

Principles of Robot Intelligence from Nature/Nurture Tips: ϕʔεϥΠϯͷಋೖ ৄղ ڧԽֶश
41 ঢ়ଶʹͷΈґଘ͢ΔεΧϥʔؔ਺𝑏(𝑠)ΛՃ͑Δ෼ʹ͸ޯ഑͸มԽ͠ͳ͍ʂ -> Ձ஋ؔ਺𝑉(𝑠)͸ϕʔεϥΠϯͱͯ͠ར༻Մೳ ∇+ ln 𝜋 = ∇+𝜋 𝜋 = 1

Principles of Robot Intelligence from Nature/Nurture ະ஌ͷؔ਺Λֶश͢Δ≒ؔ਺ۙࣅ ৄղ ڧԽֶश 42
Ձ஋ؔ਺΍ํࡦؔ਺ΛύϥϝʔλͰܗঢ়มߋՄೳͳؔ਺ۙࣅثΛ࢖ֶͬͯशʂ جఈͷߴ͞ΛॏΈͰௐ੔ ௐ੔͞Εͨجఈͷ࿨Ͱؔ਺Λۙࣅ ྫɿઢܗؔ਺ۙࣅ 𝑦 = 𝑤,𝜙(𝑥) جఈؔ਺ʢಛ௃ྔʣ 𝜙: 𝑋 → ℝ

Principles of Robot Intelligence from Nature/Nurture ڧྗͳؔ਺ۙࣅ≒ਂ૚ֶश ৄղ ڧԽֶश 43
ಛ௃ྔ΋ύϥϝλϥΠζͯ͠ಉ࣌ʹֶश͢Δ͜ͱͰߴ͍ۙࣅੑೳΛൃشʂ † ֶश͕ෆ҆ఆԽ͠΍͍͢ͷͰରࡦ͕ඞཁ…ʢޙड़ʣ தؒ૚1 w/ !! χϡʔϩϯ · · · શ#૚ͷதؒ૚ʢಛ௃ྔࢉग़ʣ ೖྗ૚ $ ग़ྗ૚ % ࣸ૾ &: ℝ " → * ෇͖ ׆ੑԽؔ਺ &! தؒ૚2 w/ !# χϡʔϩϯ ׆ੑԽؔ਺ &# தؒ૚# w/ !$ χϡʔϩϯ ׆ੑԽؔ਺ &$ ɿॏΈɾόΠΞεʹΑΔઢܗม׵

Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶश ৄղ ڧԽֶश 44
e.g. https://skrl.readthedocs.io/en/latest/ ڧԽֶश಺ͷؔ਺ۙࣅʹਂ૚ֶशΛ׆༻ͭͭ͠ɼ ଛࣦؔ਺Λޡࠩٯ఻೻๏ʴ֬཰తޯ഑߱Լ๏Ͱ࠷খԽʂ ޯ഑߱Լ๏ʹΑΔଛࣦؔ਺࠷খԽ ޡࠩٯ఻೻๏ʹΑΔޯ഑ͷޮ཰తܭࢉ optimizer.zero_grad() loss = objective_RL(data, net) loss.backward() optimizer.step() ਂ૚ֶशϥΠϒϥϦʢPytorchʣ෩ͷίʔυ ࢖͍ճ͠

Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿܦݧ࠶ੜ ৄղ ڧԽֶश 45
͜Ε·ͰͷܦݧΛόοϑΝʹ஝ੵͯ͠Կ౓΋ֶशʹར༻͢Δ͜ͱͰޮ཰վળ † ซ༻ՄೳͳֶशΞϧΰϦζϜ͸ݱํࡦʹґଘ͠ͳ͍ํࡦΦϑܕͷΈͳͷͰ஫ҙ ߦಈ ! ঢ়ଶ " / ใु # FIIFOόοϑΝ $ ܦݧ (", $, "!, %) · · · ݹ͍ܦݧ͔Β࡟আ ϥϯμϜʹબ୒ͯ͠ ֶशʹԿ౓΋࠶ར༻

Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿλʔήοτωοτϫʔΫ ৄղ ڧԽֶश 46
؇΍͔ʹϝΠϯʹ௥ै͢ΔผωοτϫʔΫΛ࢖ͬͯTD๏ͳͲͷڭࢣ৴߸Λਪఆ -> ڭࢣ৴߸ͷมಈ͕཈੍͞Εֶश͕҆ఆʹʂʢ΍Γա͗͸ֶश͕஗Ԇ͢ΔͷͰ஫ҙʣ ೖྗ ग़ྗ ൺֱ ޡࠩٯ఻೻๏ʹΑΔߋ৽ λʔήοτ ! " ίϐʔʹΑΔߋ৽ ϝΠϯ ! ྫɿ ` 𝜙 ← 1 − 𝜏 ` 𝜙 + 𝜏𝜙 (w/ 𝜏 ≪ 1)

Principles of Robot Intelligence from Nature/Nurture ਂ૚ڧԽֶशͷॏཁςΫχοΫɿΞϯαϯϒϧֶश ৄղ ڧԽֶश 47
ܭࢉίετ͸૿͑Δ͕ɼෳ਺ͷग़ྗΛൺֱ͢Δ͜ͱͰؔ਺ۙࣅͷޡࠩΛ཈੍ʂ ʴ ग़ྗͷ෼ࢄ͔Βֶश͕଍Γͳ͍ঢ়ଶͷਪఆɾ୳ࡧଅਐʹ΋༗༻ʢޙड़ʣ ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク1 !! · · · b) தؒ૚·Ͱڞ༗͢ΔΞϯαϯϒϧ a) ಉҰωοτϫʔΫʹΑΔΞϯαϯϒϧ !! !" !# ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク2 !" · · · ೖྗ தؒ૚1 தؒ૚! ग़ྗ ネットワーク" !# · · · · · · ೖྗ தؒ૚1 தؒ૚! ग़ྗ1 ग़ྗ2 ग़ྗ" · · · · · ·

Principles of Robot Intelligence from Nature/Nurture 3. ࠷৽ͷActor-CriticΞϧΰϦζϜ

Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿʢ૬ରʣΤϯτϩϐʔ ৄղ ڧԽֶश 49
֬཰෼෍ͷᐆດ͞΍ଞͷ֬཰෼෍͔Βͷဃ཭౓ΛఆྔධՁ͢Δʂ † ֬཰෼෍ͷϞσϧʹΑͬͯ͸ظ଴஋ԋࢉΛղੳతʹղ͚Δ ≥ 0 ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ KL(𝑝"| 𝑝# ≃ 0 KL(𝑝"| 𝑝# ≫ 1 ℋ 𝑝 ≃ 0 ℋ 𝑝 ≫ 1 Τϯτϩϐʔ ʢ࿈ଓ஋ͩͱඍ෼Τϯτϩϐʔͱ΋ʣ ૬ରΤϯτϩϐʔ ʢKullback-LeiblerμΠόʔδΣϯεʣ ີ౓ൺ

Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿॏ఺αϯϓϦϯά ৄղ ڧԽֶश 50
ظ଴஋ԋࢉʹ͓͚Δཧ࿦্ͷ֬཰෼෍ͱཚ୒ʹ༻͍ͨ֬཰෼෍ͷࠩҟΛ ཧ࿦తʹ౳ՁੑΛ୲อ͠ͳ͕Βิঈ͢Δʂ = 1 ֬཰ม਺ 𝑥 ग़ྗ 𝑦 𝑓(𝑥) 𝑝(𝑥) 𝑞(𝑥) ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ີ౓ൺ

Principles of Robot Intelligence from Nature/Nurture ॏཁͳςΫχοΫɿ࠶ύϥϝʔλԽτϦοΫ ৄղ ڧԽֶश 51
֬཰෼෍͔Βཚ୒͞Εͨม਺Λ௨Δޡࠩٯ఻೻ʢܭࢉάϥϑͷอ࣋ʣͷͨΊʹɼ ཚ୒༻ϊΠζͱ෼෍ύϥϝʔλΛ෼཭ɾ߹੒͢Δʂ 𝑥 𝑝 𝑦 𝑥) 𝑦 ℒ ޡࠩٯ఻೻ ௨ৗͷཚ୒ 𝑥 𝜃-(𝑥) 𝑦 ℒ ޡࠩٯ఻೻ ࠶ύϥϝʔλԽτϦοΫ 𝜖 ߹੒ ϊΠζ ਖ਼ن෼෍ɿ𝑦 = 𝜇 𝑥 + 𝜎 𝑥 ⊙ 𝜖

Principles of Robot Intelligence from Nature/Nurture ํࡦͷมԽΛߟྀͨ͠A2C ৄղ ڧԽֶश 52
ܦݧΛಘͨ౰࣌ͷํࡦ𝜋./0 ͱֶश͍ͨ͠ݱํࡦ𝜋͸ඞͣ͠΋Ұக͠ͳ͍… -> ॏ఺αϯϓϦϯάΛ׆༻࣮ͯ͠૷ʹԊͬͨํࡦޯ഑ɾଛࣦؔ਺ʹ౳Ձม׵ʂ ํࡦޯ഑ ରԠ͢Δଛࣦؔ਺ ϞϯςΧϧϩۙࣅ 𝜋͚ͩඍ෼

Principles of Robot Intelligence from Nature/Nurture ํࡦߋ৽ͷ੍ݶ ৄղ ڧԽֶश 53
† https://proceedings.mlr.press/v37/schulman15.html ํࡦ͕ٸܹʹมಈ͢Δͱڍಈ΋ֶश΋ෆ҆ఆʹ… -> ֶशલޙͷํࡦมԽΛKLμΠόʔδΣϯεͰ੍ݶʂ ෆ౳੍ࣜ໿෇͖ͷํࡦֶश ͜ΕΛۙࣅతʹղ͘ͷ͕TRPO† ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ࠷దํࡦ ֬཰ม਺ 𝑥 ֬཰ີ౓ ʢor ֬཰࣭ྔʣ ੍ݶͷ෇༩

Principles of Robot Intelligence from Nature/Nurture Proximal Policy Optimization: PPO†ʢVer.
1ʣ ৄղ ڧԽֶश 54 † https://arxiv.org/abs/1707.06347 ΑΓखܰʹɼෆ౳੍ࣜ໿ʹ୅Θͬͯਖ਼ଇԽ߲Λಋೖʂ ʴ ਖ਼ଇԽͷॏΈ𝛽ΛώϡʔϦεςΟοΫͳํ๏Ͱௐ੔ PPO (Ver. 1) 𝜋./0 Λอ͓͖࣋ͯ͠ղੳղΛಘΔ͔ɼ ཚ୒͍ͯͨ͠𝑎Λ࢖ͬͯϞϯςΧϧϩۙࣅ ॏΈͷߋ৽ଇ

Principles of Robot Intelligence from Nature/Nurture Proximal Policy Optimization: PPO†ʢVer.
2ʣ ৄղ ڧԽֶश 55 † https://arxiv.org/abs/1707.06347 ͞Βʹखܰʹɼີ౓ൺΛҰఆ৚݅ͰΫϦοϓʢʹޯ഑Λθϩʹʣʂ -> ఆ਺ͷ͖͍͠஋𝜖 ≃ 0.1~0.3Λઃఆ͢Δ͚ͩͰ͔ͳΓ҆ఆͳֶश͕Մೳʹ σʔλऩू ֶश ଛࣦʢ𝐴 < 0ʣ ଛࣦʢ𝐴 > 0ʣ ີ౓ൺ 1 − 𝜖 1 + 𝜖 ਖ਼ଇԽ

Principles of Robot Intelligence from Nature/Nurture Tips: PPOʹ૊Έࠐ·Ε͍ͯΔτϦοΫ ৄղ ڧԽֶश
56 u ํࡦΤϯτϩϐʔͷਖ਼ଇԽ Ø ํࡦ͕୳ࡧΛଓ͚ΒΕΔΑ͏௚઀ΤϯτϩϐʔΛ࠷େԽ͢Δ߲Λଛࣦʹ௥Ճ † ཧ࿦తʹํࡦΤϯτϩϐʔͷਖ਼ଇԽΛՃ͑Δͷ͸ޙड़ u TD(𝜆)๏ʢGAEͱ΋ݺ͹ΕΔʣ Ø ෳ਺ͷيಓΛऩूͯ͠TDޡࠩʢΞυόϯςʔδؔ਺ʣΛTD(𝜆)๏Ͱࢉग़ Ø όοϑΝ͔Βঢ়ଶભҠσʔλ୯ҐͰཚ୒ͯ͠ϛχόονֶश † PPO͸ཧ࿦্͸ܦݧ࠶ੜͱซ༻Ͱ͖ͳ͍ͱ͞ΕΔͷͰɼ ֶशޙʹόοϑΝΛϦηοτ

Principles of Robot Intelligence from Nature/Nurture ௚઀తͳํࡦޯ഑ͷܭࢉ ৄղ ڧԽֶश 57
ํࡦޯ഑๏͕ඞཁͳͷ͸ɼཚ୒͞Εͨߦಈ͔Βํࡦʹޡࠩٯ఻೻Ͱ͖ͳ͍͔Β… -> ࠶ύϥϝʔλԽτϦοΫΛۦ࢖͠ɼΑΓޮ཰ྑ͘ߦಈՁ஋ؔ਺Λ࠷େԽ͢Δʂ ߦಈՁ஋Λ࠷େԽ͢Δํ޲ͷ ύϥϝʔλ𝜙ʹؔ͢Δޯ഑ 𝑎 ∼ 𝜋(𝑎|𝑠)ͱ౳Ձ 𝜋ʹґଘ͠ͳͯ͘΋OK

Principles of Robot Intelligence from Nature/Nurture Deep Deterministic Policy Gradient:
DDPG† ৄղ ڧԽֶश 58 † https://arxiv.org/abs/1509.02971 ํࡦΛܾఆ࿦తͳؔ਺𝑎 = 𝜇1!(𝑠)ʹͯ͠σʔλऩू࣌ͷΈ୳ࡧϊΠζΛ෇༩ ϊΠζʹ͸Ornstein–Uhlenbeck (OU)աఔͳͲΛ࢖͏͜ͱ΋͋Δ σʔλऩू ֶश ߦಈՁ஋Λ࠷େԽ͢ΔߦಈΛ TDޡࠩͷࢉग़ʹར༻≒Qֶशʹ૬౰ ୳ࡧϊΠζͷεέʔϧ𝜎͸ݻఆ -> Τϯτϩϐʔਖ਼ଇԽ͸ෆཁ

Principles of Robot Intelligence from Nature/Nurture Twin Delayed DDPG (TD3)†΁ͷվྑ
ৄղ ڧԽֶश 59 † https://proceedings.mlr.press/v80/fujimoto18a.html u ํࡦͱՁ஋ؔ਺ͷֶशλΠϛϯάΛඇಉظԽ L Ձ஋ؔ਺ΛઌʹֶशͰ͖͍ͯͳ͍ͱํࡦ͸࠷దԽͰ͖ͳ͍… Ø ํࡦͷֶशස౓ΛՁ஋ؔ਺ΑΓݮΒ͢ʢσϑΥϧτ͸൒෼ʣ u Qֶश૬౰͔ΒExpected SARSA૬౰ʹมߋ L QֶशͩͱকདྷͷՁ஋ΛաେධՁͯ͠ޡͬͨߦಈΛ࠷దͱޡ൑ఆ͠΍͍͢… Ø কདྷͷՁ஋Λฏ׈Խͯ͠աେධՁΛ཈੍ u ߦಈՁ஋ؔ਺ͷΞϯαϯϒϧֶश L ۙࣅޡࠩʹΑΓաେධՁͯ͠͠·͏ͱ ʏ Ø ಠཱͨ̎ͭ͠ͷߦಈՁ஋ؔ਺Λֶशͯ͠খ͍͞ํͷকདྷͷՁ஋Λ࠾༻

Principles of Robot Intelligence from Nature/Nurture ํࡦΤϯτϩϐʔͷ࠷େԽ ৄղ ڧԽֶश 60
ݩʑͷใुؔ਺ʹํࡦΤϯτϩϐʔΛ௥Ճ -> ࠷దํࡦ΍Ձ஋ؔ਺ͷؔ܎Λཧ࿦తʹ࠶ఆٛʂ ใुͷ࠶ఆٛ Թ౓ύϥϝʔλʢߴ͍΄ͲϥϯμϜʣ ࠷దํࡦͷ࠶ఆٛ ௨ৗͷؔ܎ʹෛͷର਺໬౓͕௥Ճ

Principles of Robot Intelligence from Nature/Nurture ιϑτϕϧϚϯํఔࣜ ৄղ ڧԽֶश 61
͜Ε·Ͱಉ༷ʹऩӹͷ࠶ؼߏ଄Λར༻ͯ͠ɼՁ஋ؔ਺ͷڭࢣ৴߸Λਪఆʂ -> ୯७ʹใुʹํࡦΤϯτϩϐʔΛՃ͑Δͷͱ͸ҟͳΔͷͰ஫ҙ ࣍ঢ়ଶʹԠͨ͡ํࡦΤϯτϩϐʔ ιϑτϕϧϚϯํఔࣜ

Principles of Robot Intelligence from Nature/Nurture Soft Q-learning: SQL† ৄղ
ڧԽֶश 62 † https://proceedings.mlr.press/v70/haarnoja17a.html ιϑτϕϧϚϯํఔࣜΛ׆༻ͯ͠ߦಈՁ஋ؔ਺Λֶशʂ -> ࠷దํࡦ͕཭ࢄߦಈۭؒͳΒιϑτϚοΫεํࡦʹ SQLͷଛࣦؔ਺ ཧ࿦্ͷ࠷దํࡦ ن֨Խఆ਺ʢ૯࿨͕1ʹ͢Δʣ

Principles of Robot Intelligence from Nature/Nurture Soft Actor-Critic: SAC† ৄղ
ڧԽֶश 63 † https://proceedings.mlr.press/v80/haarnoja18b.html SQLʹ͓͚Δ࠷దํࡦͱͷဃ཭౓Λ࠷খԽ͢Δ͜ͱͰํࡦΛֶशʂ -> ͱ͍͍ͭͭ΋ɼ͜Ε·Ͱͷํࡦޯ഑๏ͱಉ༷ʹՁ஋ؔ਺ͷ࠷େԽ SACͷํࡦߋ৽ 𝜋ͱແؔ܎ ෛͷՁ஋ؔ਺ͷ࠷খԽʹՁ஋ؔ਺ͷ࠷େԽ

Principles of Robot Intelligence from Nature/Nurture Tips: SACʹ૊Έࠐ·Ε͍ͯΔτϦοΫ ৄղ ڧԽֶश
64 u ߦಈՁ஋ؔ਺ͷΞϯαϯϒϧֶश Ø TD3ͱಉ༷ʹ̎ͭͷߦಈՁ஋ͷখ͍͞஋Λ࠾༻ u ࠶ύϥϝʔλԽτϦοΫͷར༻ Ø DDPG/TD3ͱಉ༷ʹ௚઀తͳํࡦޯ഑Λܭࢉ u ༗քͳํࡦ΁ͷม׵ Ø ਖ਼ن෼෍͔Βཚ୒ͨ͠ߦಈΛtanhؔ਺Ͱม׵ Ø ͜ͷม׵΋ؚΊͨ֬཰෼෍ϞσϧͰ໬౓΋ܭࢉ u Թ౓ύϥϝʔλͷࣗಈௐ੔ Ø ੍໿෇͖࠷దԽͱղऍͯ͠ϥάϥϯδϡ৐਺Խ ʢ࣍εϥΠυʣ σʔλऩू ֶश

Principles of Robot Intelligence from Nature/Nurture Tips: Թ౓ύϥϝʔλͷࣗಈௐ੔ ৄղ ڧԽֶश
65 https://arxiv.org/abs/2303.04356 ํࡦΤϯτϩϐʔͷਖ਼ଇԽΛෆ౳੍ࣜ໿ͱͯ͠Ұ୴ղऍ -> ʢ౳੍ࣜ໿ʹม׵͠ͳ͕Βͷʣϥάϥϯδϡͷະఆ৐਺๏Λద༻ʂ ํࡦΤϯτϩϐʔ੍໿෇͖ͷ࠷దԽ໰୊ ॴ๬ͷԼք ݩͷ࠷దํࡦ ϥάϥϯδϡͷ ະఆ৐਺๏

Principles of Robot Intelligence from Nature/Nurture Tips: ํࡦΤϯτϩϐʔΛ࠷େԽ͢Δ෭࣍ޮՌ ৄղ ڧԽֶश
66 ଟ༷ͳܦݧΛੵΉ͜ͱͰɼෆ࣮֬ੑʹΑΓؤ݈ͳํࡦʹͳΔʂ ະֶशͳ࿏໘ https://youtu.be/KOObeIjzXTY ະֶशͳखઌ΁ͷෛՙ https://youtu.be/EH3xVtlVaJw

Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿؔ਺ͷฏ׈Խ† ৄղ ڧԽֶश 67
† https://ieeexplore.ieee.org/document/9981812 ํࡦ΍Ձ஋ؔ਺͕׈Β͔ʹͳ͍ͬͯͳ͍ͱڍಈ΍ֶश͕ෆ҆ఆʹ… -> ؔ਺ΛʢදݱྗΛଛͳΘͳ͍ൣғͰʣฏ׈Խ͢Δਖ਼ଇԽΛ෇༩ʂ L ׈Β͔ʹ͍ͨ͠ ঢ়ଶۭؒ 𝑆 𝑈&(𝑠$) 𝜎 w/ 𝜆 𝑠 w/ 𝜆 𝑠$ ̃ 𝑠 𝑑'# (𝑠, ̃ 𝑠; 𝑠$) ग़ྗมԽΛ཈੍͢Δہॴۭؒ

Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿLocally Lipschitz Continuous Constraint
(L2C2)† ৄղ ڧԽֶश 68 † https://ieeexplore.ieee.org/document/9981812 ಉϨϕϧͷ੍ޚΛՄೳʹ͠ͳ͕Β΋ߦಈ͸ΑΓ׈Β͔ʹʂ -> ׈Β͔ͳߦಈ͸࣮ػ΁ͷํࡦͷσϓϩΠΛ༰қʹ΋ͯ͘͠ΕΔ

Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿώϡʔϚϊΠυ΁ͷԠ༻ ৄղ ڧԽֶश 69
ώϡʔϚϊΠυ͸ଟ͘ͷؔઅΛ࣋ͪߴ࣍ݩߦಈۭؒͱͳΓɼ׈Β͔͕͞ΑΓॏཁʹ… -> L2C2ͳͲͷख๏ΛิॿతʹϑϨʔϜϫʔΫ಺Ͱ׆༻ʂ HoST https://youtu.be/Yruh-3CFwE4 PhysHSI https://youtu.be/dTj6FjoQ5u0

Principles of Robot Intelligence from Nature/Nurture 4. ϞσϧϕʔεڧԽֶश

Principles of Robot Intelligence from Nature/Nurture Ϟσϧ“ϑϦʔ”ڧԽֶश ৄղ ڧԽֶश 71
ະ஌ͷ؀ڥͷ𝑝", 𝑟͸ະ஌ͷ··ɼऩӹΛ࠷େԽ͢ΔΑ͏ํࡦ𝜋Λ࠷దԽ͢Δ -> ଟ༷ͳيಓΛಘΔίετ͕ߴ͘ڭࢣ͋ΓֶशͰ΋ͳ͍ͷͰɼޮ཰͸ۃΊͯѱ͍… ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!)

Principles of Robot Intelligence from Nature/Nurture ৄղ ڧԽֶश 72 ؀ڥΛΤʔδΣϯτͷ೴಺ͰγϛϡϨʔτ͢ΔʢੈքʣϞσϧΛཅʹֶश͢Δ
-> ෆ଍͢ΔܦݧΛ೴಺Ͱิ͏͜ͱͰޮ཰Λܶతʹվળʂ ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!) σʔλόοϑΝ 𝒂 𝒔 𝒓 𝒔$ Ϟσϧ“ϕʔε”ڧԽֶश ׆༻ ֶश

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷઃܭʢ؍ଌ≒ঢ়ଶͷ৔߹ʣ ৄղ ڧԽֶश 73
ঢ়ଶભҠ֬཰𝑝" ͱใुؔ਺𝑟Λݸผʹͦͷ··ؔ਺ۙࣅ OR ใुؔ਺ͷҾ਺͔Β࣍ঢ়ଶ𝑠!Λআ͍ͯ֬཰෼෍ͱͯ͠·ͱΊͯؔ਺ۙࣅ ≃ ਅͷ؀ڥ ੈքϞσϧ a) ঢ়ଶભҠ֬཰ͱใुؔ਺Λݸผֶश 状態＋行動 (", $) 状態遷移確率 &! '! 状態＋行動＋次状態 (", $, "") 報酬 ( '# ࣮ࡍʹ͸෼෍ύϥϝʔλΛग़ྗ ʢྫɿਖ਼ن෼෍ͷฏۉɾ෼ࢄʣ b) ঢ়ଶભҠ֬཰ͱใुؔ਺ΛҰֶׅश 状態＋行動 (", $) 状態遷移確率・報酬確率 (&! , &# ) '$

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷֶशʢ؍ଌ≒ঢ়ଶͷ৔߹ʣ ৄղ ڧԽֶश 74
ਅͷೖग़ྗσʔλ͕ἧ͍ͬͯΔͷͰɼ୯७ʹڭࢣ͋Γֶश͢Ε͹ྑ͍ † ༧ଌͨ࣍͠ঢ়ଶΛ࠶ೖྗ͍ͯ͘͠௕ظ༧ଌʹ͸޻෉͕ٻΊΒΕΔ͜ͱ΋ʢޙड़ʣ 𝑎# 𝑠# 𝑠# !, 𝑟# 𝑝2 = 𝑝"𝑝3 σʔλόοϑΝ (𝑠!, 𝑎!, 𝑠! $, 𝑟!) !(" ) ଛࣦʢྫɿෛͷର਺໬౓ʣ − ln 𝑝*(𝑠! $, 𝑟!|𝑠!, 𝑎!) ޡࠩٯ఻೻ ཚ୒

Principles of Robot Intelligence from Nature/Nurture ؍ଌ≠ঢ়ଶͷ৔߹ɿ࣌ܥྻදݱͷ֫ಘ ৄղ ڧԽֶश 75
ݱ؍ଌ͚ͩͰ͸৘ใ͕ෆ଍͢Δ৔߹ɼ؍ଌཤྺͰ৘ใΛิ׬͠ͳ͚Ε͹ͳΒͳ͍ -> Recurrent Neural Network (RNN)ͳͲΛۦ࢖ͯ࣌͠ܥྻදݱΛֶशʂ RNNϢχοτ 入力 !!"# 内部状態 ℎ!"# 過去状態 ℎ! ల։ RNNϢχοτ 入力 !# 内部状態 ℎ!"# 初期状態 ℎ$ = 0 RNNϢχοτ 入力 !% · · · RNNϢχοτ 入力 !!"#

Principles of Robot Intelligence from Nature/Nurture ؍ଌ≠ঢ়ଶͷ৔߹ɿ௿࣍ݩදݱͷ֫ಘ ৄղ ڧԽֶश 76
ݱ؍ଌʹॏཁͳ৘ใ͕ӅΕ͍ͯΔ৔߹ɼ࣍ݩѹॖͰ৘ใΛநग़͠ͳ͚Ε͹ͳΒͳ͍ -> Variational Autoencoder (VAE)ͳͲΛۦ࢖ͯ͠௿࣍ݩදݱΛֶशʂ ೖྗ ! ʢྫɿRGBը૾ʣ ࠶ߏ੒݁Ռ f(!) જࡏม਺ ! ʢ " ≪ |%|ʣ

Principles of Robot Intelligence from Nature/Nurture දݱֶशͱ߹ΘͤͨੈքϞσϧͷྫɿPlaNet† ৄղ ڧԽֶश 77
† https://proceedings.mlr.press/v97/hafner19a.html RNNͱVAEΛ૊Έ߹ΘͤͨRecurrent State-Space Model (RSSM)Ͱ ঢ়ଶΛ֫ಘ ʴ ঢ়ଶʹجͮ͘ঢ়ଶભҠ֬཰ʢͱใु֬཰ʣΛಉ࣌ʹֶशʂ !!"# ℎ!"# αϯϓϦϯά #(%!"# |ℎ!"# , !!"# ) μΠφϛΫε ʢRNNʣ +!"# %!"# ℎ! !! αϯϓϦϯά #(%! |ℎ! , !! ) , !! ℎ! , %! (+, .! ℎ! , %! ) ,(%! |ℎ! ) %! ≃ Τϯίʔμ Τϯίʔμ σίʔμ

Principles of Robot Intelligence from Nature/Nurture RSSMͷֶश ৄղ ڧԽֶश 78
௕͞𝑇ͷ࣌ܥྻσʔλʹؔ͢Δ࠷େԽ͢΂͖ม෼ԼքΛۙࣅతʹಋग़ ʴ ౰ॳ͸Latent Overshootingͱݺ͹ΕΔ௕ظ༧ଌਫ਼౓ΛߴΊΔτϦοΫΛซ༻ ࣌ܥྻσʔλͷʢۙࣅʣม෼Լք Latent Overshooting RNNͰ𝑑 = 1,2, … 𝐷εςοϓઌ·Ͱ༧ଌ

Principles of Robot Intelligence from Nature/Nurture PlaNetͷൃలܥɿDreamerγϦʔζ ৄղ ڧԽֶश 79
u જࡏม਺ͷ཭ࢄ෼෍දݱ L Ψ΢ε෼෍ΛԾఆ͍ͯ͠Δͱදݱྗ͕๡͍͠… Ø े෼ͳ཭ࢄԽͰϚϧνϞʔμϧͳදݱྗ౳Λ֬อʂ u KLόϥϯγϯά L ࣄલ෼෍΁ͷਖ਼ଇԽͱࣄલ෼෍ͷߋ৽Λಉ࣮࣌ࢪ͢Δͱੑೳ͕ग़ͮΒ͍… Ø 2ͭͷ߲Λ෼ׂͯ͠ॏΈ෇͚͢Δ͜ͱͰόϥϯεΛௐ੔ʂ u ใुؔ਺ͷมܗ L λεΫʹΑͬͯҟͳΔใुεέʔϧ΁ͷ൚Խ͕೉͍͠… Ø Symlogม׵ʹΑΓθϩۙ๣Ͱ͸ม׵લ௨Γʹͭͭ͠ର਺ଇͰ҆ఆԽʂ u ࠷৽Ϟσϧͷ׆༻ L RNN΍ཅʹϞσϧԽ͞Εͨ֬཰෼෍͸༧ଌਫ਼౓͕଍Γͳ͍… ʢεέʔϧ΋ͮ͠Β͍…ʣ Ø Transformer΍Flow matching౳Λੵۃతʹಋೖʂ

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻ྫ ৄղ ڧԽֶश 80
u ऩӹͷਪఆ Ø nεςοϓઌ·ͰͷใुΛ༧ଌͯ͠𝑛εςοϓऩӹͷՁ஋ؔ਺Λֶश Ø J TD๏ͱϞϯςΧϧϩ๏ͷඒຯ͍͠ͱ͜औΓΛޮ཰ྑ࣮͘ࢪ Ø L ༧ଌޡ͕ࠩྦྷੵ͢ΔͨΊ௕ظ༧ଌ͸ࠔ೉ u Ծ૝తͳܦݧͷੜ੒ Ø ٖࣅσʔλΛੈքϞσϧΛ௨ͯ͡ੜ੒ֶͯ͠शʹར༻ Ø J ಺ૠͰ͋Ε͹ɼਫ਼౓ྑ͘ੜ੒ͨ͠σʔλֶ͕शΛଅਐɾ҆ఆԽ Ø L ֎ૠʹରͯ͠͸ɼޡͬͨσʔλΛੜ੒ֶͯ͠शΛ્֐͢Δةݥ u ϓϥϯχϯά Ø 𝐻εςοϓઌ·ͰͷγϛϡϨʔγϣϯ݁ՌΛجʹ࠷దͳߦಈܥྻΛਪఆ Ø J ௥ՃͰͷՁ஋ؔ਺΍ํࡦϞσϧͷֶश͕ෆཁ Ø L ܭࢉίετ͕๲େ

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿऩӹͷਪఆ ৄղ ڧԽֶश 81
nεςοϓTD๏ʢ͋Δ͍͸TD(𝜆)๏૬౰ʣΛ(𝑠%, 𝑎%, 𝑠% !, 𝑟%)ͷΈ͔ΒܭࢉՄೳʂ 𝑠+," = 𝑠+ $͔Β𝑘 = 1,2, … , 𝐻εςοϓઌ·Ͱͷ༧ଌ nεςοϓTD๏ɿModel-based Value Expansion (MVE) https://arxiv.org/abs/1803.00101 TD(𝜆)๏૬౰ɿStochastic Ensemble Value Expansion (STEVE) https://proceedings.neurips.cc/paper/2018/hash/f02208a057804ee16ac72ff4d3cec53b-Abstract.html ΞϯαϯϒϧֶशΛซ༻

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿԾ૝తͳܦݧͷੜ੒ ৄղ ڧԽֶश 82
† https://proceedings.neurips.cc/paper/2019/hash/5faf461eff3099671ad63c6f3f094f7f-Abstract.html ࣮ࡍͷঢ়ଶΛ࢝఺ʹෳ਺ͷيಓΛ֬཰తʹੜ੒ͯ͠ɼܦݧ࠶ੜ༻όοϑΝʹ௥Ճʂ ࣮σʔλ Ծ૝σʔλ 𝐷-./ ߦಈ ! ঢ়ଶ " / ใु # 𝐷012-3 ॳظঢ়ଶ ํࡦ 𝜋 ੈքϞσϧ 𝑝* †

Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒ՌɿDayDreamer ৄղ ڧԽֶश 83
https://youtu.be/xAXvfVTgqr0 ࢛٭ϩϘοτͷา༰ΛҰ࣌ؒఔ౓Ͱֶश͢Δ͜ͱʹ੒ޭʂ

Principles of Robot Intelligence from Nature/Nurture ੈքϞσϧͷ׆༻๏ɿϓϥϯχϯά ৄղ ڧԽֶश 84
ੈքϞσϧͰߦಈܥྻΛධՁ͠ɼ࠷దͳ΋ͷΛݟग़͢ʂ ! " # "! 2. !εςοϓઌ·Ͱ༧ଌ $ ධՁ஋ ߦಈܥྻ ֬཰ 1. ީิͷαϯϓϦϯά (#! ", … , #!#$ " ) "%& ' 3. '! " = )(* +! , ,! " )ʹԠͨ͡ - ͷվળ ఏҊ෼෍ʢํࡦʣ - #! ∗ ∈ ,! ∗ ऩଋ͢Δ·Ͱ܁Γฦ͠ ࣮؀ڥʹ࡞༻ ॳظঢ়ଶΛऔಘ ॳظҎ߱͸༧ଌ஋Λར༻

Principles of Robot Intelligence from Nature/Nurture ιϧόʔͷҰྫɿCross Entropy Method (CEM)
ৄղ ڧԽֶश 85 e.g. https://martius-lab.github.io/iCEM/ Cost probability Sampling Approaching 𝜋(𝑢; 𝜃4) 𝜋(𝑢; 𝜃4,") Action Elite data γ optimal action

Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒ՌɿModel Predictive Path Integral
(MPPI) ৄղ ڧԽֶश 86 https://youtu.be/f2at-cqaJMM ඇৗʹߴ଎ͳࣗಈӡసͷϦΞϧλΠϜ੍ޚʹ੒ޭʂ

Principles of Robot Intelligence from Nature/Nurture γϛϡϨʔλͷੵۃత׆༻ ৄղ ڧԽֶश 87
ϩϘοτͷCADσʔλ͸ద੾ʹม׵͢Ε͹ಈྗֶγϛϡϨʔλʹऔΓࠐΊΔ -> GPUʹΑΔฒྻγϛϡϨʔγϣϯͰେྔͷσʔλΛ୹࣌ؒͰऩूՄೳʹʂ Isaac Gym/Sim/Lab https://youtu.be/wcGt7EAkdVg RaiSim https://youtu.be/Xw3bsh0tWFU ଞɼMujoco MJX (Google Deepmind), GenesisͳͲͳͲ…

Principles of Robot Intelligence from Nature/Nurture Sim-to-RealΪϟοϓ ৄղ ڧԽֶश 88
u ݟͨ໨ͷҧ͍ Ø র໌৚݅ Ø ςΫενϟ Ø ܗঢ় Ø ͳͲͳͲ… u ڍಈͷҧ͍ Ø ຎࡲ Ø ΞΫνϡΤʔλͷ஗Ε Ø ؍ଌϊΠζ Ø ͳͲͳͲ… γϛϡϨʔγϣϯϞσϧ ࣮ػ

Principles of Robot Intelligence from Nature/Nurture ՄೳͳൣғͰͷγεςϜಉఆ ৄղ ڧԽֶश 89
γϛϡϨʔγϣϯ͕ΑΓ࣮ੈքͷϩϘοτʹ͍ۙڍಈͱͳΔΑ͏௥ٻʂ -> ಛʹɼϞʔλಛੑ͸ڍಈʹ௚݁͢Δʴಉఆ͠΍͍͢ͷͰࣄલͷಉఆ͕ఆ൪ ຎࡲରࡦྫɿBetter Actuator Models (BAM) https://youtu.be/5XPEEKDnQEM యܕతͳϞʔλͰͷSim-to-RealΪϟοϓ u ஗Ε Ø ௨৴ɾճ࿏Ԡ౴ɾࢦྩܾఆ·Ͱͷ࣌ؒɾ etc… u ຎࡲ Ø Ϋʔϩϯɾ೪ੑɾετϥΠϕοΫۂઢɾ etc… u ͦͷଞ Ø ίΪϯάτϧΫɾٯىిྗʢT-Nۂઢʣɾ όοΫϥογɾetc… † ࠷ۙͷQDD͸Ϊϟοϓ͕খ͞Ίʁ چདྷͷαʔϘϞʔλͰ͸ແࢹͰ͖ͳ͍

Principles of Robot Intelligence from Nature/Nurture υϝΠϯཚ୒Խ ৄղ ڧԽֶश 90
e.g. https://ieeexplore.ieee.org/document/8202133 ༷ʑͳγϛϡϨʔγϣϯύϥϝʔλΛ༩͑ͨ؀ڥ͔ΒσʔλΛऩू -> γϛϡϨʔγϣϯύϥϝʔλʹରͯ͠पลԽ͞ΕͨํࡦΛֶशʂ ߦಈ 𝑎 ঢ়ଶ 𝑠 + ใु 𝑟 ύϥϝʔλ𝑒" ύϥϝʔλ𝑒# ύϥϝʔλ𝑒5 … αϯϓϦϯά γϛϡϨʔγϣϯύϥϝʔλ𝑒 ֬཰ ࣮؀ڥ σʔληοτ (𝑠!, 𝑎!, 𝑠! $, 𝑟!) !(" ) ΤʔδΣϯτ

Principles of Robot Intelligence from Nature/Nurture υϝΠϯదԠʢTeacher-student architectureʣ ৄղ ڧԽֶश
91 e.g. https://arxiv.org/abs/2107.04034 γϛϡϨʔγϣϯύϥϝʔλʢಛݖ৘ใʣΛߟྀͨ͠ํࡦΛֶश͓͖ͯ͠ɼ ࣮ੈքͰͷ؍ଌཤྺ͔Βಛݖ৘ใʢ૬౰ͷಛ௃ྔʣΛਪఆͯ͠ํࡦʹ༩͑Δʂ ಛݖ৘ใ ಛ௃ྔ γϛϡϨʔλ Ձ஋ؔ਺ ํࡦ ߦಈ վળ ؍ଌ ཤྺ ଛࣦ࠷খԽ Τϯίʔμ Τϯίʔμ ࣮ੈք ํࡦ ߦಈ ؍ଌ ཤྺ Τϯίʔμ ֶश࣌ ਪ࿦࣌

Principles of Robot Intelligence from Nature/Nurture ୅දతͳ੒Ռɿ࢛٭ϩϘοτͷϩόετͳาߦ੍ޚ ৄղ ڧԽֶश 92
https://youtu.be/zXbb6KQ0xV8 4000୆ͷϩϘοτΛฒྻγϛϡϨʔγϣϯ࣮ͯ͠ػʹ΋ద༻ՄೳͳํࡦΛ֫ಘʂ

Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿ࢒ࠩڧԽֶश ৄղ ڧԽֶश 93
https://ieeexplore.ieee.org/document/8794127 ࣄલʹΘ͔͍ͬͯΔൣғͷ৘ใ͔ΒϩϘοτͷ੍ޚث𝜋5 Λઃܭ ʴ ෆ଍͍ͯͨ͠৘ใ༝དྷͷߦಈͷ࢒ࠩΛڧԽֶशͰิరʂ طଘͷ੍ޚث !! ঢ়ଶ ! RLํࡦ " ∼ ! ใु " ݩͷ؀ڥ ิਖ਼͞Εͨ؀ڥ

Principles of Robot Intelligence from Nature/Nurture 5. ใुઃܭͷ՝୊ͱରࡦ

Principles of Robot Intelligence from Nature/Nurture ใुઃܭͷ೉͠͞ ৄղ ڧԽֶश 95
u ઃܭͷࣗ༝౓ Ø 𝑟 = 𝑟(𝑠, 𝑎, 𝑠′)͑͞ຬͨͤ͹ྑ͘ɼಉ͡໨తʹෳ਺ͷใुઃܭ͕ߟ͑ΒΕΔ ྫɿ໨ඪ𝑔΁ͷϩϘοτͷखઌҐஔ𝑥ͷϦʔνϯάͳΒ… 𝑟 = 𝑓67( 𝑥 − 𝑔 8)ʢ𝐿8 ϊϧϜʹؔ͢Δ୯ௐݮগؔ਺𝑓67 ʣ 𝑓67 𝑥 = {−𝑥, 𝑥9", exp −𝑥 , 1 𝑥 < 𝛿 , −𝑥 + 𝑐, … } u ଟ໨తੑ Ø ҰͭͷλεΫΛୡ੒͢Δࡍʹෳ਺ͷαϒλεΫؚ͕·ΕΔ ྫɿ٭ϩϘοτͷ௚ਐาߦͳΒ… าߦ଎౓ʢલํ΁ͷҠಈྔʣɼࠨӈʹͣΕͳ͍͜ͱɼస౗͠ͳ͍͜ͱɼ ա౓ͳෛՙΛ͔͚ͳ͍͜ͱɼۉ౳ʹෛՙΛ෼഑͢Δ͜ͱ… u ఆྔԽͮ͠Β͍৔߹ Ø ৬ਓ͕ײ֮తʹ͔͠೺ѲͰ͖͍ͯͳ͍λεΫ͸ใुͱͯ͠਺஋Ͱදͤͳ͍ ྫɿֆͳͲͷඒతධՁɼྉཧͷຯ෇͚ɼ… u ֶश೉қ౓ Ø ׬શϥϯμϜͳํࡦʹߴ೉қ౓ͷಈ࡞Λֶशͤ͞Δʹ͸Ϊϟοϓ͕େ͖͍ ྫɿנ᛽ͷ伱ؒҠಈɼτοϓϓϩͱͷରઓɼ…

Principles of Robot Intelligence from Nature/Nurture Tips: ใु΁ͷΦϑηοτ ৄղ ڧԽֶश
96 † https://dl.acm.org/doi/abs/10.5555/645528.657613 ใुͷฏߦҠಈͳͷͰɼཧ࿦্ͷ࠷దํࡦ͸Ұக͢Δ͸ͣ†͕ͩ… -> ऴ୺ͰͷॲཧΛߟ͑Δͱɼਖ਼ͷใुͷ΄͏͕ແ೉…ʁ ྑ͋͘ΔTDޡࠩʢߋ৽ํ޲ʣ ઃܭͨ͠ใु͕ৗʹਖ਼ͳΒ… Ø ऴ୺ʹࢸΔيಓΛաখධՁ->୳ࡧଅਐ ઃܭͨ͠ใु͕ৗʹෛͳΒ… Ø ऴ୺ʹࢸΔيಓΛաେධՁ->஌ࣝར༻ ऴ୺Ҏ߱ͷใुΛ0ͱԾఆ ରࡦྫ: https://www.sciencedirect.com/science/article/pii/S2666720725000165

Principles of Robot Intelligence from Nature/Nurture ີͳʗૄͳใु ৄղ ڧԽֶश 97
u ີͳใु Ø ࠷େ஋ʹ޲͔ͬͯ୯ௐ૿Ճ Ø J ํࡦΛߋ৽͍ͯ͘͠΂͖ઌ͕Θ͔Γ΍͍͢ Ø L λεΫΛ൒୺ʹऴ͑΍͍͢ʢಛʹଟ໨త࣌ʣ Ø ෇ՃՁ஋ͷ௥ٻ޲͖ u ૄͳใु Ø ཁٻΛຬͨͨ͠ͱ͖ͷΈඇ0ɼͦΕҎ֎͸0 Ø J ใुΛඞͣ໯͑ΔΑ͏ʹʹλεΫΛୡ੒ Ø L ཁٻΛຬͨͤͳ͍ݶΓԆʑͱ୳ࡧ͕ඞཁ Ø ඞਢ৚݅ͷ࣮ݱ޲͖ Before After ྫɿ୯ҐΤωϧΪʔ౰ͨΓͷҠಈྔ ྫɿର৅෺ମͷ೺࣋ Before After

Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿHindsight Experience Replay (HER)
ৄղ ڧԽֶश 98 https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> يಓதͷকདྷ౸ୡͨ͠఺Λٖࣅతͳΰʔϧʹઃఆʂ σʔλऩू ٙࣅΰʔϧ ঢ়ଶۭؒ S E G ٙࣅΰʔϧͷબ୒

Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿHindsight Experience Replay (HER)
ৄղ ڧԽֶश 99 https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> يಓதͷকདྷ౸ୡͨ͠఺Λٖࣅతͳΰʔϧʹઃఆʂ σʔλऩू ٙࣅΰʔϧ ঢ়ଶۭؒ S G E https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html

Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ɿDisagreement ৄղ ڧԽֶश 100
e.g. https://proceedings.mlr.press/v97/pathak19a.html ະֶशྖҬΛੵۃతʹ๚Εͯ୳ࡧͤ͞ΔλεΫඇґଘͷϘʔφεΛઃܭ͍ͨ͠ -> ྫ͑͹ɼΞϯαϯϒϧֶशʹ͓͚Δ༧ଌ෼ࢄʹ஫໨ʂ 状態・行動 (", $) 予測値 &! ", $ ( = 1, … , , ֶशྖҬ ະֶशྖҬ Epistemic uncertainty Aleatoric uncertainty ޯ഑67 68 68 6) Λܭࢉͯ͠௚઀࠷େԽ΋Մೳ

Principles of Robot Intelligence from Nature/Nurture ಺ൃతಈػ෇͚ͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 101
https://youtu.be/Qob2k_ldLuw ૄͳใुʹ಺ൃతಈػ෇͚Ͱ୳ࡧଅਐͯ͠ϩίϚχϐϡϨʔγϣϯʹ੒ޭʂ

Principles of Robot Intelligence from Nature/Nurture ଟ໨తੑ ৄղ ڧԽֶश 102
୯७ʹཁٻͦΕͧΕʹରԠ͢ΔใुΛ଍͠߹ΘͤΔͱ… -> ࢧ഑తͳҰͭͷΈୡ੒ or ൒୺ͳہॴղʹऩଋ ྑ͋͘Δଟ໨తͳใुؔ਺ † 𝑤$ > 0 ྫɿର৅෺ମͷ೺࣋->Ϧʔνϯάʴ೺࣋ ঢ়ଶ ใु1, 2 ঢ়ଶ ใु1, 2 1ͷΈ࠷େԽ ྆ํ൒୺ ϦʔνϯάͰຬ଍ͯ͠͠·͏͜ͱ΋…

Principles of Robot Intelligence from Nature/Nurture ֊૚ܕڧԽֶश ৄղ ڧԽֶश 103
e.g. https://www.sciencedirect.com/science/article/abs/pii/S0952197620302207 ཁٻͦΕͧΕΛݸผͷํࡦͰୡ੒ʴͦΕͧΕͷํࡦΛద੾ʹߏ଄Խ †Լखͳߏ଄ͩͱֶशੑೳ͕ѱԽ͢Δ͜ͱ΋… Commanded task Walk Acquired behavior Sub task: Perception level Direction control Stability control Sub task: Template level Footstep control Support leg control Upper body control Sub task: Actuator level Joint 1 control Joint N control Black box when using End-to-End learning ྫɿาߦλεΫͷ֊૚ߏ଄ ্૚ͷํࡦ͕Լ૚΁ͷ໨ඪࢦྩΛੜ੒͢ΔܗͰ࿈݁ ʢ͜ͷΑ͏ͳ্૚ͷํࡦΛΦϓγϣϯͱ΋ݺͿʣ

Principles of Robot Intelligence from Nature/Nurture ଟ໨తڧԽֶश ৄղ ڧԽֶश 104
ใुΛεΧϥʔ͔ΒϕΫτϧʹ֦ு͠ɼෳ਺ͷՁ஋ؔ਺ and/or ํࡦΛֶशɾ߹੒ʂ -> ద੾ͳॏΈ෇͚Ͱॴ๬ͷબ޷ղΛಘΔඞཁ ࣮ݱՄೳͳू߹ !! (#, %) !" (#, %) ύϨʔτϑϩϯτ બ޷ղ ʢ།Ұ఺ͷަ఺ʣ ॏΈϕΫτϧ ' !(#, %; ')ͷ౳ߴઢ a) ՙॏ࿨ɿತ෦ͷΈબ޷Մೳ !! (#, %) !" (#, %) b) νΣϏγΣϑ๏ɿඇತ෦΋બ޷Մೳ ཧ૝఺ '!Ͱͷબ޷ղ '"Ͱͷબ޷ղ

Principles of Robot Intelligence from Nature/Nurture ଟ໨తڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 105
https://youtu.be/gQidYj-AKaA ॴ๬ͷӡಈʹؔ͢Δใु߲Λ෼ׂͯ͠ɼֶशޙʹॏΈΛௐ੔ͯ͠બ޷ղΛಘΔʂ

Principles of Robot Intelligence from Nature/Nurture Tips: ࿨ΑΓੵʁ ৄղ ڧԽֶश
106 ໨త͕τϨʔυΦϑؔ܎Ͱͳ͍ʴ֤ใुΛඇෛͰఆ͍ٛͯ͠ΔͳΒɼ ෳ਺ใुͷ࿨ΑΓ΋ੵͷํ͕શ໨తΛόϥϯεྑ͘ୡ੒ͨ͠ํࡦʹͳΔʁʂ 𝑟 = n :(" ; 𝑤:𝑟: 𝑟 = p :(" ; 𝑟: *% e.g. SkillMimic ਎ମɾϘʔϧͷزԿɾ૬ରؔ܎ɾ଎౓཈੍ͳͲ https://ingrid789.github.io/SkillMimic/

Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशɿ੍໿෇͖MDP ৄղ ڧԽֶश 107
҆શੑΛຬͨͨ͢Ίͷ৚݅Λίετͱͯ͠දݱ͠ɼ੍໿৚݅ͱͯ͠ผ్ਪఆɾ੍ݶʂ -> ྫ͑͹ɼKKT৚݅Λۦ࢖ͯ͠ϥάϥϯδϡ৐਺Λ߹Θͤͯ࠷దԽ͠ͳ͕Βղ͘ ࠷େڐ༰ྦྷܭίετ ֤࣌ࠁͷίετ ίετॏΈͷ࠷దԽ

Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशɿ(Dynamic) Shielding ৄղ ڧԽֶश
108 e.g. https://ieeexplore.ieee.org/document/9392290 ίετ੍໿Λຬͨͤͳ͍ߦಈ͸҆શͳ΋ͷ΁ஔ׵͢Δ͜ͱͰ҆શੑΛ୲อʂ † ϦΧόϦʔํࡦ΍੾ସج४͕ະ஌ͷ৔߹Ͱ΋ֶशՄೳ ϦΧόϦʔํࡦ (Shielding) !!"# ঢ়ଶ ! RLํࡦ " ∼ ! ใु " ݩͷ؀ڥ ҆શͳ؀ڥ # !"#$ ʹΑΔ੾ସ

Principles of Robot Intelligence from Nature/Nurture ηʔϑڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 109
https://youtu.be/B2PiFF-MhJI ShieldingΛ׆༻ͯ҆͠શͳਓͱϩϘοτͷΠϯλϥΫγϣϯʹ੒ޭʂ

Principles of Robot Intelligence from Nature/Nurture ఆྔԽͮ͠Β͍ʁ ৄղ ڧԽֶश 110
࠷ѱɼλεΫͷ੒൱Λਓ͕൑அͯ͠ૄͳใुΛ༩͑Δ͜ͱ͸Մೳ͕ͩɼ ਓͷධՁج४͕ᐆດͳ΋ͷΛγεςϜ্Ͱີʹ਺஋Խ͢Δ͜ͱ͸೉͍͠… ྫɿʓʓ෩ʹา͘ʁʁ https://github.com/BandaiNamcoResearchInc/Bandai-Namco-Research-Motiondataset

Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿߦಈΫϩʔχϯάʹΑΔํࡦͷॳظԽ ৄղ ڧԽֶश 111
ΤΩεύʔτͷํࡦΛσʔλ͔Β࠶ݱ͢Δ͜ͱͰ࠷దํࡦۙ๣͔Β୳ࡧ։࢝ʂ † Ձ஋ؔ਺͸ະֶशͳͷͰ໛฿ͨ͠ํࡦΛյ͞ͳ͍ରࡦʢྫɿਖ਼ଇԽʣ͕ඞཁ ΤΩεύʔτ w/ 𝝅𝒆 ߦಈ ! ঢ়ଶ " / ใु # ΤʔδΣϯτ 𝜋(𝑎|𝑠) ະ஌ͷ؀ڥ 𝑝"(𝑠!|𝑠, 𝑎), 𝑟(𝑠, 𝑎, 𝑠!) σʔληοτ 𝒔 𝒂 ॳظԽ ֶश

Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿٯڧԽֶश ৄղ ڧԽֶश 112
† https://proceedings.neurips.cc/paper_files/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html ΤΩεύʔτ͕࠷େԽ͠Α͏ͱ͍ͯͨ͠ใुؔ਺Λσʔλ͔Βਪఆʂ -> ۙ೥Ͱ͸ఢରతֶशͷߏ଄Λԉ༻ͯ͠ਫ਼౓ͷߴ͍໛฿͕Մೳʹ ਅͷσʔληοτ ੜ੒ث Generator ࣝผث Discriminator બ୒͞Εͨਅͷσʔλ ੜ੒͞Εِͨͷσʔλ ਅِ൑ఆ ਅͷσʔλΛݟ෼͚ΔΑ͏ֶश DiscriminatorΛὃ͢Α͏ֶश a) Generative Adversarial Network (GAN) b) Generative Adversarial Imitation Learning (GAIL)

Principles of Robot Intelligence from Nature/Nurture ໛฿ֶशɿٯڧԽֶश ৄղ ڧԽֶश 113
† https://proceedings.neurips.cc/paper_files/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html ΤΩεύʔτ͕࠷େԽ͠Α͏ͱ͍ͯͨ͠ใुؔ਺Λσʔλ͔Βਪఆʂ -> ۙ೥Ͱ͸ఢରతֶशͷߏ଄Λԉ༻ͯ͠ਫ਼౓ͷߴ͍໛฿͕Մೳʹ †

Principles of Robot Intelligence from Nature/Nurture Tips: GANͱGAILͷ਺ֶతؔ܎ ৄղ ڧԽֶश
114 Jensen-ShannonμΠόʔδΣϯεΛࣜల։͢Δ͜ͱͰੜ੒ثͱࣝผثΛಋೖʂ 𝑥͕𝑝' ͱ𝑝9 ͷͲͪΒ͔Βੜ੒͞Ε͔ͨΛ൑ผ͢ΔϕϧψʔΠ෼෍ͷର਺໬౓ͱҰக 𝑝) → 𝑝)"#$ ࠷໬ਪఆ 𝜋ʹ͸ґଘ͠ͳ͍ͷͰใु͔Β͸আ֎

Principles of Robot Intelligence from Nature/Nurture ༠ಋܕڧԽֶश ৄղ ڧԽֶश 115
e.g. https://www.science.org/doi/10.1126/scirobotics.ads6790 σϞيಓͱͷྨࣅ౓ʹؔ͢ΔใुΛઃܭɾՃຯͭͭ͠ɼ λεΫʹؔ͢ΔใुΛޮ཰ྑ͘࠷େԽʂ 𝑟 = 𝜆𝑟: + 1 − 𝜆 𝑟; ΤʔδΣϯτͷڍಈ͸ ΤΩεύʔτͱࣅ͍ͯΔʁ λεΫ͸਱ߦͰ͖͍ͯΔʁ

Principles of Robot Intelligence from Nature/Nurture Tips: ϦλʔήοςΟϯά ৄղ ڧԽֶश
116 † https://github.com/YanjieZe/GMR ਓͷσϞϯετϨʔγϣϯ͸ܭଌ͕༰қ͕ͩϩϘοτͱ਎ମߏ଄͕গ͠ҧ͏… -> खઌͳͲͷॏཁՕॴΛἧ͑ͨࡍͷ࢟੎ɾؔઅσʔλ΁ม׵ʂ (a) Reference motion (b) Retarget videos Fig. 1: For the user study, participants were shown videos of the reference motion (a), and asked to choose which retarget video (b) was more similar to it. General Motion Retargeting (GMR) Human-Robot KeyBody Matching Human-Robot Cartesian Space Alignment Human Data Non-Uniform Local Scaling Solving Robot IK with Rotation Constraint Solving Robot IK with Rotation&Translation Constraint Human Body Translation & Rotation Robot Root Pose & Joint Position Step 1 Step 2 Step 3 Step 4 Step 5 Fig. 2: General Motion Retargeting (GMR) Pipeline. The tracking errors are computed for all frames that a policy is alive. General Motion Retargeting (GMR)†

Principles of Robot Intelligence from Nature/Nurture ༠ಋܕڧԽֶशͷϩϘοτԠ༻ྫ ৄղ ڧԽֶश 117
https://youtu.be/LQizdUn5Z1k ΤΩεύʔτͷيಓपลͰϩϘοτ͕࣮ݱՄೳͳಈ࡞΁ͱޮ཰ྑ͘मਖ਼ʂ

Principles of Robot Intelligence from Nature/Nurture ֶश೉қ౓ ৄղ ڧԽֶश 118
ඇৗʹ೉͍͠λεΫΛEnd-to-EndͰֶश͠Α͏ͱ͢Δͱɼ λεΫୡ੒ʹඞཁͳٕೳΛ͍ͭ·Ͱ΋ݟग़ͩͤͳ͍… า͖ํ΋Θ͔Βͳ͍ͷʹɼ נ᛽Λආ͚ͯ࠷଎Ͱ૸Γൈ͚͍ͨ… ??

Principles of Robot Intelligence from Nature/Nurture ೉қ౓ௐ੔ɿΧϦΩϡϥϜֶश ৄղ ڧԽֶश 119
† https://proceedings.neurips.cc/paper/2020/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html λεΫ೉қ౓Λ؆୯ͳ΋ͷ͔Βঃʑʹ೉ͯ͘͠͠ํࡦΛஈ֊తʹચ࿅͍ͤͯ͘͞ʂ -> ΧϦΩϡϥϜ͸ࣗ࡞͢Δ͜ͱ͕ଟ͍͕ɼ࠷దԽ໰୊Λ௨ͯࣗ͡ಈԽ΋Մೳ Self-Paced Deep reinforcement Learning (SPDL)† ೉қ౓ௐ੔༻ίϯςΩετ ॴ๬ͷ೉қ౓ ߴ͍ऩӹΛظ଴Ͱ͖Δ೉қ౓ʹঃʑʹௐ੔

Principles of Robot Intelligence from Nature/Nurture ೉қ౓ௐ੔ɿࣗݾڝ૪ ৄղ ڧԽֶश 120
https://youtu.be/chMwFy6kXhs ରઓܕͷλεΫͰ͋Ε͹ɼରઓ૬खͷํࡦΛࣗ਎ͱʢ΄΅ʣಉ౳ʹ͢Δ͜ͱͰɼ ࣗવͱλεΫ೉қ౓͕ঃʑʹ্͕͍ͬͯ͘͜ͱʹʂ

Principles of Robot Intelligence from Nature/Nurture ༨࿥ɿਓͷू߹஌ʹΑΔใुઃܭ ৄղ ڧԽֶश 121
ਓखͰใुΛఆٛ͢Δ͜ͱͳ͘ʴσϞيಓ΋༻ҙ͢Δ͜ͱͳ͘ɼ AIΛ࢖ͬͯୡ੒͍ͨ͠λεΫͷใुΛਪఆʂ Published as a conference paper at ICLR 2024 Figure 2: EUREKA takes unmodified environment source code and language task description as context to zero-shot generate executable reward functions from a coding LLM. Then, it iterates between reward sampling, GPU-accelerated reward evaluation, and reward reflection to progressively improve its reward outputs. domain expertise to construct task prompts or learn only simple skills, leaving a substantial gap in achieving human-level dexterity (Yu et al., 2023; Brohan et al., 2023). On the other hand, reinforcement learning (RL) has achieved impressive results in dexterity (Andrychowicz et al., 2020; Handa et al., 2023) as well as many other domains-if the human designers can carefully construct reward functions that accurately codify and provide learning signals e.g. Eureka LLMͰใुؔ਺Λ࠷దԽ https://eureka-research.github.io RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences Preference predictor Fallible teacher Noisy preferences Reward learning from denoised preferences Denoising Discriminator Collecting preferences Phase 1: Pre-training for agent and reward model Replay buffer for pre-training Warm start with intrinsic rewards Phase 2: Online training for reward model Agent Env Intrinsic reward Figure 1. Overview of RIME. In the pre-training phase, we warm start the reward model ˆ rω with intrinsic rewards rint to facilit smooth transition to the online training phase. Post pre-training, the policy, Q-network, and reward model ˆ rω are all inherited as i configurations for online training. During online training, we utilize a denoising discriminator to screen denoised preferences for ro reward learning. This discriminator employs a dynamic lower bound ωlower on the KL divergence between predicted preferences Pω annotated preference labels ˜ y to filter trustworthy samples Dt , and an upper bound ωupper to flip highly unreliable labels Df . vergence between predicted and annotated preference labels to filter samples. Further, to mitigate the accumulated er- ror caused by incorrect filtration, we propose to warm start the reward model during the pre-training phase for a good (Lee et al., 2023), and reinforcement learning (Christ et al., 2017; Ibarz et al., 2018; Hejna III & Sadigh, 20 In the context of RL, Christiano et al. (2017) propos comprehensive framework for PbRL. To improve feedb e.g. RIME ڭࢣ͕ൺֱɾબ޷ͨ݁͠Ռ͔Βใुਪఆ https://proceedings.mlr.press/v235/cheng24k.html

Principles of Robot Intelligence from Nature/Nurture ·ͱΊ ৄղ ڧԽֶश 122
ü ڧԽֶशͰ͸ϚϧίϑܾఆաఔΛຬͨ͢ઃఆ͕ॏཁ Ø ຬͨͤͳ͍৔߹͸ઃܭΛݟ௚ͨ͠Γɼ࣌ܥྻ৘ใͰิ׬ͨ͠Γ ü ਂ૚ڧԽֶशͰ͸ෳ਺ͷޮ཰Խɾ҆ఆԽτϦοΫͷ׆༻͕ॏཁ Ø ͨͩ͠ɼద༻Մೳ৚݅΍σϝϦοτ΋͋ΔͷͰ஫ҙ ü ۙ೥ͷํࡦޯ഑๏͸୳ࡧೳྗͷҡ࣋΍ߋ৽ɾޯ഑ͷ׈Β͔͕͞ॏཁ Ø ௒ύϥϝʔλ͕ଟ͘ͳ͍ͬͯΔͷͰਪ঑஋ͷར༻͕Φεεϝ ü ࣮ػͰͷࢼߦࡨޡΛݮΒͨ͢ΊʹʢੈքʣϞσϧͷֶशɾ׆༻͕ॏཁ Ø Ϟσϧͷ༧ଌਫ਼౓͸Ͳͷ׆༻๏Ͱ΋ੑೳʹେ͖͘د༩ ü ࣮Ԡ༻ͰෳࡶʹͳΓ͕ͪͳใुʹ͸దͨ͠޻෉͕ॏཁ Ø ෳ਺ͷใु߲Λࠞͥ߹ΘͤΔ͚ͩͰ͸ݶք͕͋ΔͷͰ஫ҙ p ঺հٕज़͸೔ਐ݄าͰվྑ͞Ε͍ͯΔͷͰར༻࣌ʹ͸ௐࠪΛ๨Εͣʹ…

詳解 強化学習 / In-depth Guide to Reinforcement Learning

詳解 強化学習 / In-depth Guide to Reinforcement Learning

More Decks by PRIN Lab

Other Decks in Technology

Featured

Transcript

詳解強化学習 / In-depth Guide to Reinforcement Learning

詳解強化学習 / In-depth Guide to Reinforcement Learning