コンピュータビジョンによるロボットの視覚と判断：宇宙空間での適応と課題

೥݄೔ɿ4QIFBSʮӉ஦º"*ʯ ίϯϐϡʔλϏδϣϯʹΑΔϩϘοτͷࢹ֮ͱ൑அɿӉ஦ۭؒͰͷదԠͱ՝୊ ౻٢߂࿱ʢத෦େֶɾཧ޻ֶ෦"*ϩϘςΟΫεֶՊʣ IUUQNQSHKQ

ࣗݾ঺հɿ౻٢߂࿱ GVKJZPTIJ!GTDDIVCVBDKQ 2 த෦େֶϩΰ த෦େֶϩΰ ֶྺɿ ೥ذೆ޻ۀߴߍిࢠՊଔۀ ೥த෦େֶిࢠ޻ֶՊଔۀ ೥த෦େֶେֶӃम࢜՝ఔमྃ
೥த෦େֶେֶӃത࢜ޙظ՝ఔຬظୀֶʢത࢜ʣ ݚڀ׆ಈɿ ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴϙευΫݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ߨࢣ ೥த෦େֶ޻ֶ෦।ڭत ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ڭत ～ ೥ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ ݱࡏʹࢸΔ ֶ֎׆ಈɿ ೔ຊσΟʔϓϥʔχϯάڠձཧࣄ ΫϩεΞϙΠϯτϝϯτʢσϯιʔʣ vol.162 ౻٢߂࿱ʢத෦େֶʣʮ“+AI”ͰมΘΔະདྷʯ IUUQTXXXZPVUVCFDPNXBUDI W[K&0J7)6 ౻٢߂࿱ ஑ాӯࣿ͞Μ (೫໦ࡔ46) ઒ాेເ͞Μ

ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ าߦऀݕग़ )0( 47. ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF
"EB#PPTU CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5 εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0' #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $// ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463' ੵ෼ը૾ʹΑΔߴ଎Խ '"45 ܾఆ໦ʹΑΔίʔφʔݕग़ 03# ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"% ؔ࿈͢Δ78ͷಛ௃ +PJOU)0( $P)0( )0(ͷڞىදݱ #3*&' ೋ஋ಛ௃ $"3% ಛ௃ྔͷೋ஋Խ 5FYUPO ϑΟϧλόϯΫ $)"-$ ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7(( ૚ (PPH-F/FU ૚ 3FT/FU ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$// 3FHJPO1SPQPTBM :0-0 4JOHMFTIPU 44% 4JOHMFTIPU '$/ ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU 1ZSBNJE1PPMJOH .%FU ϚϧνϨϕϧಛ௃ϐϥϛου 7J5 7JTJPO5SBOTGPSNFS 4&/FU &YDJUBUJPO %*/0 ."& 44- 7J5 4JN$-3 ରরֶश .P$P ରরֶश 4FH/FU &ODPEFSEFDPEFS %FFQ-BCW "USPVT$POWPMVUJPO 4FH'PSNFS 7J5 %&53 5SBOTGPSNFS '1/ ಛ௃ϐϥϛου 6/FU '$/Λར༻ & ffi DJFOU/FU /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF (//ͷར༻ $FOUFS/FU ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश $-*1 ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--& ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4". ج൫Ϟσϧ (FNJOJ --B7" 7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$// &OEUPFOEͰ࣮ݱ -P'53 5SBOTGPSNFSʹΑΔ&& &.." &&ࣗಈӡస (SPPU/ 7JTJPO-BOHVBHF"DUJPONPEFM #&35 (15 -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/ ਓͷ஌ݟͷ૊ΈࠐΈ -*'5 $//ͷར༻ ୈੈ୅ɿ$//ʹΑΔಛ௃දݱ֫ಘ ୈੈ୅ɿϋϯυΫϥϑτಛ௃ ը૾ೝٕࣝज़ͷมભ

$//ʢ৞ΈࠐΈχϡʔϥϧωοτϫʔΫʣͷಛ௃நग़ 4 த෦େֶϩΰ த෦େֶϩΰ $POWMBZFS 1PPMJOHMBZFS *OQVUMBZFS $POWMBZFS 1PPMJOHMBZFS '$MBZFS
0VUQVUMBZFS '$MBZFS *OQVUJNBHF YY ⋯ ⋯ 'MBUUFO LFSOFMT TJ[FYY TUSJEF QBEEJOH LFSOFMT TJ[FYY TUSJEF QBEEJOH ɹ৞ΈࠐΈͱϓʔϦϯάΛଟஈʹ܁Γฦ͢͜ͱͰ޿͍ൣғͷॏཁͳಛ௃Λू໿͠ ɹɹશ݁߹૚ͰҐஔʹґଘ͠ͳ͍ಛ௃Λ֫ಘʢϩʔΧϧˠϛυϧˠάϩʔόϧʣ ৞ΈࠐΈ ϓʔϦϯά ৞ΈࠐΈ ϓʔϦϯά શ݁߹ શ݁߹

w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭֶͯ͠श $//ʹΑΔଟ༷ͳλεΫ΁ͷԠ༻ ʜ z1FSTPOz W H W′
H′ H W W H ըૉ͝ͱʹΫϥε֬཰Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚ C άϦου͝ͱʹ Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ Ϋϥε֬཰Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁Ռ $// $// ෺ମݕग़ɹ ը૾෼ྨɹ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ ୅දతͳख๏ "MFY/FU 7(( (PPHMF/FU 3FT/FU 4&/FU 'BTUFS3$// :0-0 44% '1/ .%FU $FOUFS/FU .BTL3$// '$/ 6/FU 4FH/FU 141/FU %FFQ-BCW

าߦऀݕग़ )0( 47. ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU
CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5 εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0' #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $// ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463' ੵ෼ը૾ʹΑΔߴ଎Խ '"45 ܾఆ໦ʹΑΔίʔφʔݕग़ 03# ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"% ؔ࿈͢Δ78ͷಛ௃ +PJOU)0( $P)0( )0(ͷڞىදݱ #3*&' ೋ஋ಛ௃ $"3% ಛ௃ྔͷೋ஋Խ 5FYUPO ϑΟϧλόϯΫ $)"-$ ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7(( ૚ (PPH-F/FU ૚ 3FT/FU ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$// 3FHJPO1SPQPTBM :0-0 4JOHMFTIPU 44% 4JOHMFTIPU '$/ ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU 1ZSBNJE1PPMJOH .%FU ϚϧνϨϕϧಛ௃ϐϥϛου 7J5 7JTJPO5SBOTGPSNFS 4&/FU &YDJUBUJPO %*/0 ."& 44- 7J5 4JN$-3 ରরֶश .P$P ରরֶश 4FH/FU &ODPEFSEFDPEFS %FFQ-BCW "USPVT$POWPMVUJPO 4FH'PSNFS 7J5 %&53 5SBOTGPSNFS '1/ ಛ௃ϐϥϛου 6/FU '$/Λར༻ & ffi DJFOU/FU /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF (//ͷར༻ $FOUFS/FU ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश $-*1 ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--& ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4". ج൫Ϟσϧ (FNJOJ --B7" 7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$// &OEUPFOEͰ࣮ݱ -P'53 5SBOTGPSNFSʹΑΔ&& &.." &&ࣗಈӡస (SPPU/ 7JTJPO-BOHVBHF"DUJPONPEFM #&35 (15 -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/ ਓͷ஌ݟͷ૊ΈࠐΈ -*'5 $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ ୈੈ୅ɿ7J5ʹΑΔಛ௃දݱ֫ಘ

w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏ ը૾Λݻఆύονʹ෼ղ 4FMG"UUFOUJPOʹΑΓύονؒͷؔ܎ੑΛଊ͑Δ *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ405" 7JTJPO5SBOTGPSNFS 7J5
<%PTPWJUTLJZ *$-3> Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 p dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efﬁcient in practice, since it can be implemented using highly optimized matrix multiplication code. Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together 5SBOTGPSNFS 7J5 7J5ͷ࢓૊Έ

w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3
v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. ֤ύον͕ଞͷύονʹରͯ͠ͲΕ͚ͩʮ஫໨ʯ͍ͯ͠Δ͔Λද͢"UUFOUJPOXFJHIUΛܭࢉ ˠը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ ̂ α = softmax( QKT dk ) x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ 4FMG"UUFOUJPOͷ࢓૊Έ "UUFOUJPOXFJHIU qi = Wq ei ki = Wk ei vi = Wv ei ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵

w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3
v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output1 output2 output3 output4 output5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. Attention(Q, K, V) = ̂ αV x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ ᶅ7ͱͷՃॏ࿨ʹΑΓ ग़ྗΛܭࢉ x5 x4 x3 x2 x1 4FMG"UUFOUJPOͷ࢓૊Έ

w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ<5VKJ $PH4DJ> 7J5ʹ͓͚Δಛ௃දݱ֫ಘ ataset. off-diagonal ng the error
nfusion ma- mpares what 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) ςΫενϟ ܗঢ় ਓؒ ♦︎ ˠܗঢ়Λॏࢹ $// ˔˙˛˔ ˠςΫενϟΛॏࢹ 7J5 ˝˝ ˠ$//ͱൺ΂ͯܗঢ়Λॏࢹ

4FH'PSNFSʹΑΔηϚϯςΟοΫηάϝϯςʔγϣϯ<9JF /FVS*14> IUUQTXXXZPVUVCFDPNXBUDI W+.P32[;F6 ϊΠζͷӨڹΛड͚ੑೳ͕ྼԽ ϊΠζʹର͠ϩόετ 7J5͸෺ମͷܗঢ়ಛ௃Λ֫ಘ͢ΔͨΊɼςΫενϟϊΠζͷӨڹΛड͚ʹ͍͘

าߦऀݕग़ )0( 47. ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU
CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5 εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0' #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $// ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463' ੵ෼ը૾ʹΑΔߴ଎Խ '"45 ܾఆ໦ʹΑΔίʔφʔݕग़ 03# ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"% ؔ࿈͢Δ78ͷಛ௃ +PJOU)0( $P)0( )0(ͷڞىදݱ #3*&' ೋ஋ಛ௃ $"3% ಛ௃ྔͷೋ஋Խ 5FYUPO ϑΟϧλόϯΫ $)"-$ ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7(( ૚ (PPH-F/FU ૚ 3FT/FU ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$// 3FHJPO1SPQPTBM :0-0 4JOHMFTIPU 44% 4JOHMFTIPU '$/ ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU 1ZSBNJE1PPMJOH .%FU ϚϧνϨϕϧಛ௃ϐϥϛου 7J5 7JTJPO5SBOTGPSNFS 4&/FU &YDJUBUJPO %*/0 ."& 44- 7J5 4JN$-3 ରরֶश .P$P ରরֶश 4FH/FU &ODPEFSEFDPEFS %FFQ-BCW "USPVT$POWPMVUJPO 4FH'PSNFS 7J5 %&53 5SBOTGPSNFS '1/ ಛ௃ϐϥϛου 6/FU '$/Λར༻ & ffi DJFOU/FU /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF (//ͷར༻ $FOUFS/FU ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश $-*1 ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--& ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4". ج൫Ϟσϧ (FNJOJ --B7" 7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$// &OEUPFOEͰ࣮ݱ -P'53 5SBOTGPSNFSʹΑΔ&& &.." &&ࣗಈӡస (SPPU/ 7JTJPO-BOHVBHF"DUJPONPEFM #&35 (15 -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/ ਓͷ஌ݟͷ૊ΈࠐΈ -*'5 $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ ୈੈ୅ɿ7-.

w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^
ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS SPBE CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ

w $-*1 $POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH ʹΑΔಛ௃දݱ֫ಘ ը૾ͱݴޠͷಛ௃ྔΛඥ͚ͮΔΑ͏ʹରরֶश ԯͷը૾ͱݴޠϖΞΛ΢Σϒ͔Βऩूֶͯ͠शʹ࢖༻ ը૾ͱςΩετΛϖΞͰֶशɿը૾ͱݴޠͷରԠ
ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎ $&MPTT

-"*0/.σʔληοτ<4DIVINBOO /FVS*148> w 0QFO$-*1ͷֶशʹ࢖༻͞Εͨԯͷը૾ɾςΩετϖΞͰߏ੒͞Εͨσʔληοτ ೥͔Β೥ʹऩू͞Εͨ΢ΣϒϖʔδͰߏ੒͞Εͨ$PNNPO$SBXM͔ΒϖΞΛऩू 0QFO"*͕ެ։͍ͯ͠ΔֶशࡁΈ$-*1Λ༻͍ͯը૾ɾςΩετϖΞΛϑΟϧλϦϯά ը૾
ςΩετ "QQMJDBUJPOPG'V[[Z-PHJDJO 8BUFS3FTPVSDFT&OHJOFFSJOH "TIJTI1BUFM (JSMZDVUFUSFOEZB[UFD BOEFTEFTJHOEBSUCPBSE 4QPOHFCPC1JDUVSFT

w ඥ͚ͮͨը૾ͱݴޠͷؔ܎͔ΒΫϥε෼ྨΛθϩγϣοτͰղ͘͜ͱ͕Մೳ ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़ Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़ ը૾ͱςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉˠྨࣅ౓͕࠷΋ߴ͍ςΩετͷΫϥεͱ൑ఆ $-*1ʹΑΔಛ௃දݱ֫ಘ
I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classiﬁer from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ

w $*'"3ͷը૾ͱϓϩϯϓτͷಛ௃ྔΛ6."1Ͱ࣍ݩʹ࣍ݩ࡟ݮͯ͠ՄࢹԽ ͳͥը૾ͱݴޠͷؔ܎͔ΒΫϥε෼ྨ͕Մೳʁ ɿը૾ͷಛ௃ྔ ɿϓϩϯϓτͷಛ௃ྔ Ϋϥε͝ͱʹಛ௃ྔ͕ಠཱ͠ɼը૾ͱϓϩϯϓτ͕ಉ͡ՕॴʹຒΊࠐ·Ε͍ͯΔ BQIPUPPGBBJSQMBOF BQIPUPPGBCJSE BQIPUPPGBDBU
BQIPUPPGBEFFS BQIPUPPGBGSPH BQIPUPPGBIPSTF BQIPUPPGBBVUPNPCJMF BQIPUPPGBEPH BQIPUPPGBTIJQ BQIPUPPGBUSVDL

w $*'"3ͷඈߦػΫϥεʹ͓͍ͯը૾ͱϓϩϯϓτؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ϓϩϯϓτ͝ͱʹ࿮Λ৭෇͚ ը૾ͱݴޠʢΩϟϓγϣϯʣΛඥ͚ͮΔޮՌ ੨࿮ BQIPUPPGBBJSQMBOF fl ZJOHUISPVHIUIFCMVFTLZ
੺࿮ BCMBDLBOEXIJUFQIPUP PGBBJSQMBOF ྘࿮ BQIPUPPGBBJSQMBOF UIBUMBOEFEPOUIFHSPVOE ˠಉҰΫϥε಺ʹ͓͍ͯ΋ݴޠʹ߹Θͤͨॊೈͳಛ௃ɾ஌ࣝΛ֫ಘ

w ैདྷͷը૾෼ྨϞσϧ͸ը૾ͱΫϥε*%Λ݁ͼ͚ͭΔΑ͏ʹֶश ը૾͕ԿΛҙຯ͍ͯ͠Δͷ͔͸ཧղ͍ͯ͠ͳ͍ʢΫϥεؒͷҙຯతؔ܎Λߟྀ͠ͳ͍ʣ w $-*1͸ը૾ͱݴޠʢΩϟϓγϣϯʣΛ݁ͼ͚ͭΔΑ͏ʹֶश ຒΊࠐΈۭؒʹ͓͍ͯҙຯతʹ͍ۙΩϟϓγϣϯ͸ۙ͘ɼҟͳΔ΋ͷ͸ԕ͘ʹ഑ஔ ैདྷͷը૾෼ྨϞσϧͱ$-*1ͷҧ͍
*% *% *% *% *% ैདྷͷը૾෼ྨϞσϧ ֶशͰ݁ͼ͚ͭ ֶशͨ͠ΫϥεͷΈ෼ྨՄೳ l5XPEPHTTUBOEJOH JOUIFTOPXz $-*1 ֶशͰ݁ͼ͚ͭ ΦʔϓϯϘΩϟϒϥϦը૾ೝ͕ࣝՄೳ

w ࢹ֮৘ใͱςΩετ৘ใΛಉ࣌ʹॲཧͯ͠ੜ੒͢Δ7-. 4FMG"UUFUOJPO ɿύονը૾ʢೖྗτʔΫϯʣؒͷؔ܎ʹج͍ͮͯಛ௃Λදݱ $SPTT"UUFOUJPO ɿը૾ͷग़ྗτʔΫϯͱ࣭໰จͱͷؔ܎ʹج͍ͮͨಛ௃Λදݱ ςΩετΛग़ྗɿ7J5 --.
*NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classiﬁcation, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. ςΩετೖྗɿ࣭໰จʢϓϩϯϓτʣ ςΩετग़ྗɿճ౴จ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ը૾ೖྗɿ

w 7JTVBM2VFTUJPO"OTXFSJOH 72" ը૾ʹؔ͢Δ࣭໰ΛจͰ෇͚ɼ౴͑Λੜ੒͢ΔλεΫ ʢྫʣ࣭໰ʮ͜ͷը૾Ͱݘ͸ԿΛ͍ͯ͠·͔͢ ʯˠɹग़ྗʮϘʔϧΛ௥͍͔͚͍ͯ·͢ʯ w *NBHF$BQUJPOJOH
ը૾Λೖྗͱ͠ɼͦͷ಺༰Λ؆ܿʹจষͰઆ໌͢ΔλεΫ ʢྫʣೖྗਫंΛ֦େڸͰݟΔগ೥ͷը૾ɹˠग़ྗʮগ೥͕ਫंΛ؍࡯͍ͯ͠Δʯ w 5FYUUP*NBHF(FOFSBUJPO จࣈͰࣄ෺΍৔໘Λࢦఆͨ͠આ໌Λ΋ͱʹɺରԠ͢Δը૾ΛҰ͔Βੜ੒͢ΔλεΫ ʢྫʣจʮେւͷલͷΧϥʔϑϧͷεʔπΛணͨݘʯˠɹग़ྗରԠ͢Δը૾Λੜ੒ w 3FGFSSJOH&YQSFTTJPO$PNQSFIFOTJPO ࢦࣔઆ໌จΛ΋ͱʹɼը૾தͷର৅෺Λಛఆ͢ΔλεΫ ʢྫʣจʮ͠Ζ͍ҥΛணͨਓͷӈଆͷݘʯˠग़ྗࢦఆ͞ΕͨݘͷൣғΛό΢ϯσΟϯά 7-.ͷओཁλεΫͱͦͷྫ

w --.ɿେن໛ͳςΩετσʔλΛ༻͍ͨࣄલֶशʹΑΓɼ෯޿͍λεΫ΁ͷదԠ͕Մೳ จ຺Λߟྀͨ͠ݴޠཧղͱੜ੒͕Մೳ 5SBOTGPSNFSΛجͱͨ͠൚༻తͳϞσϧ͕ొ৔ େن໛ݴޠϞσϧ --. ͷൃల த෦େֶϩΰ
த෦େֶϩΰ 5SBOTGPSNFS<7BTXBOJ /*14> (15 (15 ʜ #&35<%FWMJO "$-> --B."<5PVWSPO BS9JW> 7JDVOB<1FOH BS9JW> ,0","<1FOH /FVSM14> %F#&35B<)F BS9JW> 3P#&35B<-JV *$-3> "-#&35<-BO *$-3>

w $-*1ͷը૾Τϯίʔμͱ--B."ϕʔεͷݴޠσίʔμ --. Λར༻ͨ͠7-. ը૾ಛ௃ͱݴޠಛ௃Λڮ౉͢͠Δ7JTJPOMBOHVBHFDPOOFDUPSΛಋೖ ܭࢉίετ͕গͳ͘ɼ୹ظؒͰֶशՄೳʢº"(16Ͱ໿೔ʣ --B7"<-JV $713>
ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 6TFS *GUIFSFBSFGBDUVBMFSSPSTJOUIFRVFTUJPOT QPJOUJUPVU JGOPU QSPDFFEUPBOTXFSJOHUIFRVFTUJPO 8IBU`TIBQQFOJOHJOUIFEFTFSU ࣭໰ʹࣄ࣮ޡೝ͕͋Δ৔߹͸ࢦఠ͍ͯͩ͘͠͞ɻͦ͏Ͱͳ͍৔߹͸ɺ࣭໰΁ͷճ౴ʹਐΜͰ͍ͩ͘͞ɻ ࠭യͰ͸Կ͕ى͍ͬͯ͜ΔͷͰ͠ΐ͏͔ʁ --B7" 5IFSFBSFOPEFTFSUTJOUIFJNBHF5IFJNBHFGFBUVSFTBCFBDI XJUIQBMNUSFFT BDJUZTLZMJOF BOEBMBSHFCPEZPGXBUFS ը૾ʹ͸࠭യ͸͍ࣸͬͯ·ͤΜɻ Ϡγͷ໦͕ੜ͍ໜΔϏʔνɺ֗ͷεΧΠϥΠϯɺͦͯ͠޿େͳਫҬ͕͍ࣸͬͯ·͢ɻ ςΩετग़ྗ

ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹ 7JTJPOMBOHVBHFDPOOFDUPS .-1 ͷΈΛֶश
ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH ສ݅ͷࢦࣔσʔλΛ࢖༻ ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश --B7"ͷֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 5VOFE 'SP[FO

ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹ 7JTJPOMBOHVBHFDPOOFDUPS .-1 ͷΈΛֶश
ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH ສ݅ͷࢦࣔσʔλΛ࢖༻ ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 5VOFE 'SP[FO --B7"ͷֶश ˠ$-*1ͷը૾ΤϯίʔμΛར༻͠ ը૾ಛ௃ΛݴޠۭؒʹదԠ

w (FNJOJ<"OJM BS9JW> (PPHMF͕։ൃͨ͠ը૾ɾԻ੠ɾಈըɾςΩετશͯΛԣஅ͢ΔϚϧνϞʔμϧ--. .--. ը૾ཧղʹՃ͑ͯԻ੠ཧղɾಈըղੳɾίʔυੜ੒ɾଟݴޠରԠ΁ͱਐԽ
֤ϞμϦςΟΛઐ༻ΤϯίʔμͰॲཧͯ͠ڞ௨ͷ5SBOTGPSNFSσίʔμʹ౷߹ͯ͠ਪ࿦ ϚϧνϞʔμϧɿ༷ʑͳϞμϦςΟͷೖग़ྗ

w .--.ͱͯ͠(PPHMF͕։ൃͨ͠(FNJOJ<)XBOH BS9JW>Λ࢖༻ (FNJOJɿը૾ͱςΩετΛೖྗ͠ɼςΩετΛग़ྗ͢ΔࣗݾճؼܕϞσϧ 5SBOTGPSNFS%FDPEFS w ӡసλεΫΛ(FNJOJͷೖग़ྗؔ܎ʹམͱ͠ࠐΜͰ&.."Λߏங
ೖྗɿपғͷ৘ใʢը૾ʣɼӡసφϏήʔγϣϯʢςΩετʣɼաڈͷࣗं྆ͷ৘ใʢςΩετʣ ग़ྗɿӡసλεΫʹؔ͢Δग़ྗʢςΩετʣ 7-.ʹΑΔࣗಈӡసɿ&.."<)XBOH BS9JW>

w 3ʙ3ͷεςοϓͷࢥߟΛߦͳ͔ͬͯΒະདྷͷߦಈ 8BZ1PJOU Λੜ੒ &.."ʹΑΔϓϥϯχϯάɿ$IBJOPGUIPVHIU1SPNQUJOH γʔϯ 3 ɿఱީɼ࣌ࠁɼަ௨ঢ়گɼಓ࿏৚݅ͳͲΛؚΜͩઆ໌ ɹʢྫʣఱؾ͸շ੖ͰனؒͰ͢ɻಓ࿏͸ंઢͷ෼཭͞Ε͍ͯͳ͍௨ΓͰɼதԝʹԣஅาಓ͕͋Γ·͢ɻ
ɹɹɹ௨Γͷ྆ଆʹ͸றं͞Εͨं͕͋Γ·͢ɻ ॏཁͳ෺ମ 3 ɿࣗं྆ͷӡసʹӨڹΛ༩͑ΔՄೳੑ͕͋Δ෺ମΛಛఆʢ%#&7࠲ඪͰදݱʣ ɹʢྫʣาߦऀ͕< >ʹ͍·͢ɻं͕྆< >ʹ͍·͢ɻɹ ॏཁͳ෺ମͷڍಈ 3 ɿॏཁͳ෺ମͷݱࡏͷঢ়ଶͱҙਤΛઆ໌ ɹʢྫʣาߦऀ͸ݱࡏɼาಓʹཱ͍ͬͯͯಓ࿏Λݟ͓ͯΓɼಓ࿏Λԣஅ͠Α͏ͱ͍ͯ͠ΔՄೳੑ͕͋Γ·͢ɻ ӡసͷҙࢥܾఆ 3 ɿ؍ଌ݁ՌΛجʹӡసܭըΛཁ໿ʢӡసͷҙࢥܾఆΛΧςΰϦʹ෼ྨʣ ɹʢྫʣݱࡏͷ௿଎Λҡ࣋͢΂͖Ͱ͢ɻ

&.."ʹΑΔϓϥϯχϯάྫ %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά ΰϛା ˠΰϛାʢݕग़ର৅Ͱ͸ͳ͍෺ମʣΛආ͚ΔΑ͏ʹϓϥϯχϯά
ˠϦεʢݕग़ର৅Ͱ͸ͳ͍෺ମʣʹͿ͔ͭΒͳ͍Α͏ʹݮ଎ Ϧε ˠ৴߸͕ԫ৭ͷͨΊݮ଎ ৴߸ ଟ༷ͳӡసγφϦΦʹ͓͍ͯ҆શ͔ͭޮ཰తͳϓϥϯχϯάΛੜ੒ ˠ৴߸͕੺৭ͷͨΊఀࢭ ৴߸

w ϩϘοτͷΞΫγϣϯΛੜ੒͢ΔͨΊʹ(FNJOJΛϑΝΠϯνϡʔχϯάͨ͠7-"Ϟσϧ Ϋϥ΢υͰϗετ͞ΕΔ7JTJPO-BOHVBHF"DUJPO 7-" όοΫϘʔϯ ϩϘοτͷΦϯϘʔυίϯϐϡʔλ্Ͱಈ࡞͢ΔϩʔΧϧΞΫγϣϯσίʔμ ܥ Gemini
Robotics: Bringing AI into the Physical World Figure 1 | Overview of the Gemini Robotics family of embodied AI models. Gemini 2.0 already exhibits capabilities relevant to robotics such as semantic safety understanding and long contexts. The robotics-speciﬁc training and the optional specialization processes enable the Gemini Robotics models to exhibit a variety of 7-"ϞσϧʹΑΔϩϘοτಈ࡞ੜ੒ɿ(FNJOJ3PCPUJDT (PPHMF IUUQTXXXZPVUVCFDPNXBUDI W.W(ONN1D

υϝΠϯγϑτ Ӊ஦ۭؒͰ͸஍্Ͱ܇࿅͞Εͨ7-.͕લఏͱ͢Δσʔλʢࣗવޫɺ஍ٿ্ͷ෺ମɺ؀ڥ৚݅ͳͲʣ ͱେ͖͘ҟͳΔ ʢྫɿ݄໘΍Ր੕ͷর໌৚݅ɺ஍ܗɺਓ޻෺ͳͲ͸஍ٿͷը૾σʔληοτʹଘࡏ͠ͳ͍ʣ ࢹ֮৘ใͷܽଛɾྼԽ ϊΠζͷଟ͍ը૾ɺ௿ղ૾౓ɺࢹ໺ͷ੍ݶʢڱࢹ໺Χϝϥɺ௨৴੍ݶԼͰͷѹॖͳͲʣ ˠೖྗը૾ͷ඼࣭௿ԼʹΑΓɺݴޠੜ੒΍ೝࣝਫ਼౓͕ஶ͘͠௿Լ ݴޠͱͷ੔߹ੑͷ่Ε
7-.͸ࣗવݴޠͱࢹ֮৘ใͷରԠؔ܎Λֶश͢Δ͕ɺʮӉ஦༻ޠʯʮϛογϣϯݻ༗ޠኮʯʮະ஌ ͷࣄ৅ʯΛѻ͏ޠኮ͕܇࿅͞Ε͍ͯͳ͍ ˠద੾ͳઆ໌ੜ੒΍໋ྩཧղʹࢧো Ӊ஦ۭؒʹ7-.Λద༻͢Δʹ͸

w ࣄલֶशࡁΈ--B7"#ϞσϧΛӉ஦؀ڥͰͷΞϓϦέʔγϣϯʹඍௐ੔͞Εͨ7-. Ӊ஦ۭؒʹରԠͨ͠7-.ɿ4QBDF--B7"<'PVUUFS BS9JW> Figure 2: We present Space-LLaVA,
initialized from a pre-trained LLaVA 13B model [2] and ﬁne-tuned to extraterrestrial applications with our synthetically generated dataset of, e.g., instruction-following, conversations constructed from three

w ໨త Ӊ஦ۭؒʹదԠͨ͠7-.ߏஙʹඞཁͳߴ඼࣭ͳࢹ֮ɾݴޠσʔλͷੜ੒ θϩγϣοτੑೳͷڧԽͱϑΝΠϯνϡʔχϯάͷͨΊͷج൫ w ࢖༻͢Δσʔληοτ 4QBDF--B7"ͷσʔλੜ੒ύΠϓϥΠϯ
σʔληοτ ಺༰ ओͳ༻్ "*.BST Ր੕ը૾ʴ஍ܗϚεΫ ஍ܗೝࣝͱจ຺෇͖࣭໰Ԡ౴ͷੜ੒ .*$% Ր੕ը૾ʴࣗવݴޠΩϟϓγϣϯ Ωϟϓγϣϯˠ࣭໰ม׵ 4QBDF4DJFODF2" Ӊ஦Պֶ࿦จϕʔεͷ2" Պֶత஌ࣝཧղͱਪ࿦

w ΩϡϦΦγςΟϩʔόʔ͔Βऩू͞ΕͨՐ੕ͷ஍ܗͷສຕͷը૾ Ϋϥ΢υιʔεʹΑΔηάϝϯςʔγϣϯϚεΫ w ϚεΫʴը૾Λ(15Pʹೖྗͯ͠஍ܗهड़ɾཻࢠ෼ੳͳͲͷ2"ϖΞΛੜ੒ "*.BSTσʔληοτ<4XBO $71384> (a)
MSL NAVCAM image of Mars’ landscape. (b) Crowd-sourced segmentation masks superimposed on Martian landscape. Figure 3: The AI4Mars dataset [20] provides access to image captures of Mars’ terrain with crowd-sourced annotations for four terrain classes: “regolith”, “sand”, “bedrock”, “large rock(s)”. Terrain beyond 30m is left unlabeled. needed for extraterrestrial applications. As a first step towards a space foundation model, we demonstrate the opportunity for FMs to mitigate data scarcity by synthetically augmenting extraterrestrial science datasets, such as AI4Mars. Specifically, we generate a multi-modal dataset comprised of 150k QA tuples designed to emulate the detailed sensory reasoning required for tasks like identifying sites of scientific interest. We fine-tune an open-source Vision- Language Model (VLM) on our synthetic dataset, herein referred to as the Space-LLaVA dataset, and demonstrate the model’s utility by providing language annotations on planetary observations and tasks withheld from training. That lunar rover. Our evaluations demonstrate that: 1) existing VLMs are deficient visual reasoners in extraterrestrial applications; 2) our Space-LLaVA dataset endows a SoTA VLM with zero-shot performance increases servicing unseen extraterrestrial task types through instruction-tuning; 3) a small percentage, e.g., 20%, of the pre-training data is sufficient to safeguard against catastrophic forgetting; 4) FMs can be effectively integrated into modular autonomy stacks to enable embodied high-level planning in space robotics. 2. RELATED WORK Vision-Language Models: The advent of the Transformer [24] and derivative architectures, e.g., the Vision-Transformer [25], have powered recent advances in natural language and image processing through the use of VLMs trained on internet- scale text and image databases, e.g., Common Crawl and WebImageText [26]. Early work in vision-language modeling at scale [27] aligns a latent representation of vision and language by using a vision and text encoder with a contrastive learning objective; a VLM builds on this architecture by using a language model for open-ended visual reasoning such as VQA [28, 29, 23, 30]. In this work, we investigate adapting LLaVA-v1.5-13B [2] to extraterrestrial robotics through fine- tuning given this model is SoTA among open-source models on standard VQA benchmarks [18, 31]. Foundation Models in Robotics: Prior work has incorporated foundation models within the broader robot autonomy stack in various ways ranging from planning [9], decision making [32] and semantic reasoning [7, 6] to visual reasoning [33]. How- ever, the opportunity for foundation models in extraterrestrial robotics represents an emerging area of research. The Robot Operating System Agent [14] employs FMs to build a human- ஍ܗΫϥεɿ wϨΰϦε SFHPMJUI w࠭ TBOE wؠ൫ CFESPDL wେ͖ͳؠ MBSHFSPDL T /"4"ͷՐ੕୳ࠪंɿ$VSJPTJUZ applications, we develop a VQA generation pipeline based on the AI4Mars and MICD [42] datasets supplemented by recent publications in astrophysics. Explicitly, we translate AI4Mars’ segmentation masks into visual context for GPT- assisted annotation of seven terrain-based, semantic tasks on Martian imagery, and inspired by cosmosage [44], we introduce our own QA dataset reflecting scientific insights and facts captured by publications in arXiv’s astrophysics category, e.g., Earth and Planetary Astrophysics, which we refer to as the SpaceScienceQA dataset. We first discuss our simple and scalable methodology to produce fine-grained sensory reasoning tasks on the AI4Mars dataset and MICD. Then, we detail our approach to synthetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. Figure 7: Our SpaceScienceQA dataset offers QA tuples evaluating a language model’s understanding of scientific insights and facts in astrophysics. Verbose sections of the question and answer are omitted for brevity. assisted image annotation. These seven tasks, e.g., terrain comparison, listed fully in Section A, are designed to support Space-LLaVA as a tool for annotating planetary imagery, whose terrain-aware annotations may be used downstream by a specialized, task-specific ML algorithm. For each task, we design a total of ten questions to accomplish the same objective with varied prose, e.g., if the task is scene description, then we may pose the question as 1) “describe the landscape in view.” or 2) “what do you see in this image?”, etc., so as to discourage over-fitting to a particular prompt’s writing style in adaptation, i.e., fine-tuning. Before we query GPT-4o to perform e.g., terrain comparison, for a particular image, we first superimpose the appropriate terrain segmentation mask(s) on the original MSL NAVCAM image to color-code the landscape, as shown in Figure 6, creating visual context to support GPT-4o’s analysis. Through the use of visual context and additional language context provided in the prompt, we request the desired annotation in a format that is readily discernible zero-shot by a SoTA VLM like GPT-4o, i.e., the requested annotation does not require prior, expert knowledge to answer the question. Importantly, all visual and language context is only provided to GPT-4o to promote high-quality data curation; this same context is withheld from training Space-LLaVA as these features are not available at inference. Further details on the specific prompt used for data curation, e.g., the user and system message, are provided in Section A. Then, with the MICD dataset, we have the inverse problem: dataset and MICD. Then, we detail our approach to synthetically generate high-quality science QA pairs for our SpaceScienceQA dataset. Our full dataset’s composition based on prompt style and the designed fine-tuning tasks is presented in Figure 5. GPT-assisted Annotation: AI4Mars & MICD Datasets We translate the high-quality, segmentation masks afforded by the AI4Mars dataset, as shown in Figure 3b, into seven distinct, semantic-reasoning tasks through the use of GPT- (a) Terrain Description: GPT-4o annotates a candidate AI4Mars landscape with a description of the terrain in view. (b) Grain Characterization: GPT-4o annotates a candidate AI4Mars landscape by detailing the size and arrangement of particles Space-LLaVA as whose terrain-awa a specialized, task design a total of ten with varied prose then we may pose in view.” or 2) “wh discourage over-fi in adaptation, i.e. perform e.g., terr we first superimp mask(s) on the ori the landscape, as s support GPT-4o’s and additional lan request the desire discernible zero-s requested annotati to answer the que context is only pr data curation; thi Space-LLaVA as Further details on e.g., the user and s Then, with the M the MICD datase geological and ter and we simply mu precedes the answ which request a ca specific examples GPT-assisted Ann As a first step to e extraterrestrial sc to build an FM SpaceScienceQA scientific insights ஍ܗهड़ͷ2" ཻࢠ෼ੳͷ2"

w Ր੕ͷ஍ܗಛ௃Λهड़ͨ͠ ૊ͷը૾ΩϟϓγϣϯϖΞ ஍࣭ߏ଄ʢྫɿଯੵؠɺؠੴɺ࠭ɺϨΰϦεʣɺ஍ܗɺࢹ֮తಛ௃ʢܗঢ়ɾ഑ஔͳͲʣΛهड़ w طଘͷΩϟϓγϣϯ " ʹద੾ʹઌߦ͢Δ࣭໰ 2
Λܾఆ͢Δͱ͍͏ʮٯ໰୊ʯΛղ͍ͯ2" ϖΞΛੜ੒ ྫɿl1SPWJEFBTIPSUDBQUJPOGPSUIJTJNBHFz ྫɿl4VNNBSJ[FUIFSFMFWBOUGFBUVSFTJOZPVSWJFXz .BSUJBO*NBHF$BQUJPO%BUBTFU .*$% <2JV 1MBOFU4QBDF4DJ> raining. That objectives, to ion, can pro- n imagery for in instruction- ently answer ergent ability ually present aracterization e generations c forgetting— broad perfor- d by the fine- LaVA with a on-following idate that our ior through a rks; for com- ustworthiness ct domain of cy of current e models can ng FMs into erfaces with Specifically, s realistic 3D omous rover the use of an monitor for a robotics represents an emerging area of research. The Robot Operating System Agent [14] employs FMs to build a human- robot language interface for operators using bespoke robotic technology; SpaceTransformers [34] fine-tunes three varia- tions of the BERT [35] architecture on a corpora of systems engineering texts and an augmented mission standards dataset to recognize space mission requirements. In a similar context, SpaceQA [36] builds on SpaceTransformers by creating an LLM for space mission design, which is suitable for pre- launch mission design and evaluation but is not extensible to in-flight robot operations. Toward the use of a foundation model for in-flight operation, [37] leverages GPT-3.5 [38] as the policy backbone for language-based autonomous satellite Figure 4: Sample from the Martian Image Caption Dataset (MICD): “conglomerate outcrops and float rocks and regolith.” 3 Ωϟϓγϣϯྫɿ lDPOHMPNFSBUFPVUDSPQTBOE fl PBUSPDLTBOESFHPMJUIz l᛽ؠͷ࿐಄ɺුੴɺͦͯ͠ϨΰϦεʢද૚౔ʣz

w ݴޠϞσϧͷఱମՊֶʹؔ͢ΔཧղΛਂΊΔ͜ͱΛ໨తͱͯ͠ઃܭ w (15ʹΑΓBS9JWͷఱମ෺ཧֶΧςΰϦʹ͓͚Δ࠷৽ͷ ͷग़൛෺͔Β2"ϖΞΛੜ੒ ʢӉ஦࿦ͱۜՏ֎ఱମ෺ཧֶɺ஍ٿͱ࿭੕ఱମ෺ཧֶɺۜՏͷఱମ෺ཧֶɺఱମ෺ཧֶͷͨΊͷػثͱํ๏ɺଠཅͱ߃੕ͷఱମ෺ཧֶʣ 4QBDF4DJFODF2" Figure 7:
Our SpaceScienceQA dataset offers QA tuples ࣭໰ɿ 1BSLFS4PMBS1SPCFϛογϣϯͷओ ͳ໨త͸ԿͰ͋ΓɺͦΕΒͷ໨ඪΛ ୡ੒͢ΔͨΊʹ౥ࡌ͞Ε͍ͯΔՊֶ ػث͸ԿͰ͔͢ʁ ճ౴ɿ 141ϛογϣϯʹ͸ओʹͭͷ໨త ͕͋Γ·͢ɻୈҰʹɺίϩφՃ೤ͷ ࢓૊ΈͱɺͦΕʹଓ͘ଠཅ෩ͷ์ग़ Λௐࠪ͢Δ͜ͱͰ͢ɻ͜ΕΒͷ໨త Λୡ੒͢ΔͨΊʹɺ141ʹ͸ͭͷ ՊֶػثεΠʔτ͕౥ࡌ͞Ε͍ͯ· ͢ɻྫ͑͹ɺଠཅͷ౷߹Պֶ؍ଌ ʢ*40*4ʣɺి࣓৔؍ʢ'*&-%4ʣ ͳͲ͕͋Γ·͢ɻ her, while afore- EED+ [21], offer s, e.g., semantic ese multi-modal uage which may FM. Therefore, pabilities of FMs o extraterrestrial n pipeline based upplemented by tly, we translate context for GPT- semantic tasks mosage [44], we Figure 7: Our SpaceScienceQA dataset offers QA tuples ࣭໰ɿ ߴ࣭ྔ9ઢ࿈੕ʢ).9#ʣʹ͓͚Δ ॳظ࣭ྔൺ͸ɺ಺෦Ͱܗ੒͞ΕΔத ੑࢠ੕ͷεϐϯपظʹͲͷΑ͏ͳӨ ڹΛ༩͑Δͷ͔ʁ ճ౴ɿ ).9#ʹ͓͚Δॳظ࣭ྔൺ͸ɺத ੑࢠ੕ͷεϐϯपظʹେ͖ͳӨڹΛ ༩͑·͢ɻ࣭ྔൺ͕ʹ͍ۙɺ·ͨ ͸࿈੕ܥ͕ඇৗʹେ࣭ྔͰ͋Δ৔ ߹ɺ).9#ͷण໋͸୹͘ͳΓ·͢ɻ ͜ΕʹΑΓɺڧ͍࣓৔͕ܗ੒͞Εɺ ൺֱతେ͖ͳ४ฏߧεϐϯपظ͕ಘ ΒΕ·͢ɻͦͷ݁Ռɺதੑࢠ੕ͷε ϐϯपظ͸ΑΓ௕͘ͳΓ·͢ɻ

w طଘͷ--B7"#ϞσϧΛϑΝΠϯνϡʔχϯάͯ͠Ӊ஦ۭؒʹಛԽͨ͠7-.Λߏங w ϑΝΠϯνϡʔχϯάઓུ 7JTJPO&ODPEFS͸ౚ݁ʢGSFF[Fʣ ߋ৽ର৅ɿ-BOHVBHF.PEFMʢ7JDVOB#ʣʴ.VMUJNPEBM"EBQUFS૚ -PTTؔ਺ɿΫϩεΤϯτϩϐʔʢԠ౴ͷஞޠతτʔΫϯ༧ଌʣ
w ๨٫๷ࢭͷ޻෉ʢ$BUBTUSPQIJD'PSHFUUJOHରࡦʣ Ӊ஦ಛԽσʔλͷΈͰ܇࿅͢Δͱɺ൚༻ੑ͕ࣦΘΕΔՄೳੑ ղܾࡦɿ--B7"*OTUSVDULͷҰ෦Λࠞ߹ʢʣ ޮՌɿ*OTUSVDUJPOGPMMPXJOHੑೳͱઐ໳ੑͷཱ྆ 4QBDF--B7"ͷֶश

w ೖྗɿ ը૾ɿϩʔόʔͷ౥ࡌΧϝϥը૾ʴ݄໘ͷ্ۭ͔ΒͷϏϡʔʢτοϓμ΢ϯϏϡʔʣ ςΩετɿλεΫهड़ͱఏҊ͞Εͨ8BZ1PJOUͷϦετ w ग़ྗɿ ఏҊ͞Εͨܦ࿏ͷςΩετ෼ੳʢ҆શੑɺ࣮ݱՄೳੑͷධՁΛؚΉʣʴඞཁʹԠͯ͡ɺ୅ସ8BZ1PJOUͷηοτ
4QBDF--B7"ͷʹΑΔ݄໘ϞϏϦςΟγφϦΦ Figure 11: The lunar mobility scenario. (Left) A lunar rover and lander are situated in a virtual lunar environment. The rover, equipped with multiple onboard cameras, must navigate from its starting position to the lander, guided by a candidate path plan, such as one provided by a hypothetical ground team. (Right) The FM serves as a high-level path planner and runtime monitor, λεΫهड़ 8BZ1PJOU ஍্νʔϜ͸ɺϩʔόʔ͕ϥϯμʔʢண཮ػʣʹ޲͔ͬͯφϏήʔτ͢ΔͨΊͷܭըΛఏҊ͍ͯ͠·͢ɻఏ Ҋ͞Εͨܦ࿏ʹϋβʔυʢةݥՕॴʣ͕ͳ͍͔Λ֬ೝ͠ɺඞཁʹԠͯ͡୅ସϧʔτΛఏҊ͍ͯͩ͘͠͞ɻ ఏҊ͞Εͨܦ༝஍఺͸ҎԼͷ௨ΓͰ͢ɿ< > ͜ͷܦ࿏͸ඇৗʹӨͷೱ͍ΤϦΞΛ௨ա͢ΔͨΊɺϋβʔυΛݟམͱ͢Մೳੑ͕͋Γ·͢ɻϩʔόʔ͕҆શʹϥϯμʔ ʹ໭ΔͨΊͷ୅ସφϏήʔγϣϯϓϥϯΛҎԼʹఏҊ͠·͢ɿ< >

w ՝୊େن໛͔ͭଟ༷ͳӉ஦ؔ࿈σʔληοτͷऩू ϦϞʔτηϯγϯάσʔλɺӉ஦ඈߦγϛϡϨʔγϣϯɺӉ஦෺ମΧλϩάͳͲɺे෼ͳن໛ͱଟ ༷ੑΛ࣋ͭӉ஦ؔ࿈σʔληοτͷऩू͕ෆՄܽ w ՝୊ଟ༷ͳϞμϦςΟʹରԠ Ӊ஦ϩϘοτ΍Ӵ੕͸ɺݴޠ΍ࢹ͚֮ͩͰͳ͘ɺ-J%"3ɺ(14ɺδϟΠϩείʔϓɺελʔɾτϥο ΧʔɺϨʔμʔٕज़ͳͲɺଟछଟ༷ͳηϯαʔϞμϦςΟΛ༗͢Δ
ˠଟ༷ͳσʔλϞμϦςΟΛॲཧͯ͠ҙࢥܾఆͷͨΊͷҙຯͷ͋ΔදݱΛநग़͢Δ ɹσʔλΤϯίʔμͷ։ൃ͕ඞਢ 4QBDF--B7"ͷ՝୊

コンピュータビジョンによるロボットの視覚と判断：宇宙空間での適応と課題

コンピュータビジョンによるロボットの視覚と判断：宇宙空間での適応と課題

More Decks by Hironobu Fujiyoshi

Other Decks in Science

Featured

Transcript