Upgrade to Pro — share decks privately, control downloads, hide ads and more …

第61回名古屋CV・PRML勉強会:ECCV2024論文紹介(Mamba-ND)

Naoki Okamoto
December 20, 2024
44

 第61回名古屋CV・PRML勉強会:ECCV2024論文紹介(Mamba-ND)

12月21日に行われた第61回名古屋CV・PRML勉強会のECCV2024論文紹介で使用したスライドです.NLP分野で提案されたMambaをCV分野に応用するMamba-ND: Selective State Space Modeling for Multi-Dimensional Dataを紹介.

Naoki Okamoto

December 20, 2024
Tweet

Transcript

  1. ࣗݾ঺հ  ΞϯαϯϒϧֶशͷͨΊͷ஌ࣝৠཹ  IUUQTXXXFDWBOFUQBQFSTFDDW@QBQFST@&$$7 IUNM@&$$7@@QBQFSQIQ ࣗݾڭࢣ͋ΓֶशͷνϡʔτϦΞϧ  #FTUHSBQIGPSFOTFNCMF OPEFT

    த෦େֶϩΰ த෦େֶϩΰ  ˠ"DRVJSFEJ⒎FSFOUBUUFOUJPONBQT EJWFSTJUZTVJUBCMFGPSFOTFNCMFT JOQVU #SJOHDMPTFSUP #SJOHDMPTFSFBDIPUIFS #SJOHDMPTFSUP #SJOHDMPTFSUP #SJOHDMPTFSUP 4FQBSBUFGSPN 4FQBSBUFGSPN IUUQTTQFBLFSEFDLDPNOBPL[JKJKJBPTIJBSJYVFYJ OJZPSVTIJRJBOYVFYJDWJNUJZVUPSJBSV %BSL,OPXMFEHF ,OPXMFEHF%JTUJMMBUJPO <)JOUPO /*148`> ڭࢣͷ֬཰෼෍ʢ஌ࣝʣΛ ༻͍ͯੜెΛֶश .PEFMDPNQSFTTJPO <#VDJMV㶙 4*(,%%`> Ξϯαϯϒϧͷग़ྗΛϥϕϧͱͯ͠ ͭͷχϡʔϥϧωοτϫʔΫΛֶश Ϟσϧͷ૊Έ߹Θͤ ஌ࣝͷछྨɾ஌ࣝͷసҠํ๏ ೥      44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ ෳ਺ͷڭࢣʹΑΔΞϯαϯϒϧΛར༻ .VMUJQMF5FBDIFS <:PV ,%%`> ֬཰෼෍Λू໿ '&&% <1BSL,XBL &$"*`> ಛ௃ϚοϓΛू໿ ࣗ෼ࣗ਎ͷ஌ࣝΛར༻ TFMGEJTUJMMBUJPO ਂ͍૚ͷ஌ࣝΛઙ͍૚΁సҠ -FBSOJOHBVOJpFEDMBTTJpFS <)PV $713`> #FZPVSPXOUFBDIFS <;IBOH *$$7`> ෳ਺ͷੜెͷΈͰֶश %.- <;IBOH $713`> ੜెؒͷ஌ࣝৠཹʹΑΓਫ਼౓͕޲্ 0/& <-BO /FVSM14`> $PMMBCPSBUJWFMFBSOJOH <4POHˍ$IBJ /FVSM14`> ੜెͷઙ͍૚ΛॏΈڞ༗ͯ͠ύϥϝʔλ਺Λ࡟ݮ ஈ֊తʹ஌ࣝΛసҠ   7*% <"IO $713`> ૬ޓ৘ใྔ $3% <5JBO *$-3`> ରরֶश "'% <$IVOH *$.-`> ఢରతֶश ,OPXMFEHF%J⒎VTJPO <)VBOH /FVS*14`> ֦ࢄϞσϧͷֶशํ๏ ,OPXMFEHF3FWJFX <$IFO $713`> ҟͳΔਂ͞ͷ૚ͷؒͰ ஌ࣝΛసҠ .(% <:BOH &$$7`> ϚεΫͨ͠ੜెͷಛ௃Ϛοϓ͔Β ڭࢣͷಛ௃ϚοϓΛ༧ଌ தؒ૚ͷ஌ࣝͷసҠํ๏Λվળ 3,% <1BSL $713`> αϯϓϧؒͷؔ܎ੑ 'MPXPG4PMVUJPO1SPDFEVSF <:JN $713`> ૚ؒͷग़ྗͷ૬ޓؔ܎ "UUFOUJPO5SBOTGFS <;BHPSVZLP *$-3`> "UUFOUJPONBQ தؒ૚ͷग़ྗ͔Β஌ࣝΛநग़ ".3"%*0 <3BO[JOHFS $713`> ෳ਺ͷج൫Ϟσϧ %*/0W $-*1 4". ֶशΛૣظऴྃͨ͠ڭࢣΛར༻ 3$0 <+JO *$$7`> 0OUIFF⒏DBDZ <$IPˍ)BSJIBSBO *$$7`> ೳྗΪϟοϓ໰୊ʹରԠ "VUP,% <-J *$$7`> தؒ૚ͷ஌ࣝදݱ &OTFNCMF,5( <0LBNPUP &$$7`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ ,%;FSP <-J /FVS*14`> ஌ࣝͱଛࣦͷ૊Έ߹Θͤ -BSHFTDBMFEJTUSJCVUFE <"OJM *$-3`> ֬཰෼෍Λू໿ %VBMOFU <)PV *$$7`> ಛ௃ϚοϓΛू໿ ෳ਺ͷੜెʹΑΔΞϯαϯϒϧΛར༻ %BUBTFU%JTUJMMBUJPO <8BOH BS9JW`> ֶशࡁΈϞσϧͷਫ਼౓͕ߴ͘ͳΔ Α͏ʹೖྗϊΠζΛ࠷దԽ ͦͷଞɿσʔληοτͷৠཹ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ஌ࣝΛసҠ ڭࢣ ڭࢣ #"/ <'VSMBOFMMP *$.-`> 4NBMMˠ4NBMMˠʜ TFMGEJTUJMMBUJPO 5FBDIFS"TTJTUBOU <.JS[BEFI """*`> -BSHFˠ.JEEMFˠ4NBMM ʢೳྗΪϟοϓ໰୊ʹରԠʣ %BUBEJTUPSUJPOHVJEFETFMGEJTUJMMBUJPO <9VBOE-JV """*`> ݩσʔλ͕ಉ֦͡ுޙͷσʔλͷग़ྗΛ༧ଌ ʢσʔλ͔Βσʔλ΁ͷTFMGEJTUJMMBUJPOʣ ஌ࣝΛసҠ ੜె σʔλ ͭͷڭࢣͰΞϯαϯϒϧ %BUBEJTUJMMBUJPO <3BEPTBWPWJD $713`> σʔλ֦ுΛར༻ 1SFQBSJOH-FTTPOT <8FO /FVSPDPNQVUJOH`> ޡೝࣝͨ͠σʔλͷ஌ࣝͱ ෆ࣮֬ͳ஌ࣝΛௐ੔ (SBEVBM4BNQMJOH(BUF <.JOBNJ .7"`> ਖ਼ղͨ͠σʔλͷ ஌ࣝͷΈΛసҠ ग़ྗ૚ͷ஌ࣝͷసҠํ๏Λվળ 'VODUJPO.BUDIJOH <#FZFS $713`> NJYVQʹΑΔଟ༷ͳը૾Λ༻͍ͯ ڭࢣͱੜెؒͰؔ਺Ϛονϯά &⒎FDUJWFOFTTPGGVODUJPONBUDIJOH JOESJWJOHTDFOFSFDPHOJUJPO <:BTIJNB &$$78`> ϥϕϧͳ͠σʔλΛ༻͍ͯؔ਺Ϛονϯά ؔ਺Ϛονϯάͱͯ͠஌ࣝৠཹΛ࠶ߟ %*45 <)VBOH /FVS*14`> ΫϥεؒʹՃ͑ͯ Ϋϥε಺ͷ૬ؔΛసҠ 0GGMJOF %JTUJMMBUJPO 0OMJOF %JTUJMMBUJPO ஌ࣝΛసҠ ڭࢣ ੜె ΑΓଟ༷ͳ৘ใΛ࣋ͭ தؒ૚ͷग़ྗΛར༻ 'JU/FUT <3PNFSP *$-3`> தؒ૚ͷ஌ࣝͱͯ͠ ಛ௃ϚοϓΛ࢖༻ ɹɹɿύϥϝʔλΛݻఆ ɹɹɿύϥϝʔλΛߋ৽ ڭࢣɿֶशࡁΈϞσϧ ੜెɿະֶशͷϞσϧ ੜెͷΈΛ༻͍ͯ ੜెؒͰ஌ࣝΛసҠ ڭࢣͷ஌ࣝΛੜె΁సҠ ஌ࣝৠཹͷࣗಈઃܭ ஌ࣝసҠΛิॿ͢ΔϞσϧΛ௥Ճ 3FTJEVBM,% <(BP BS9JW`> ஌ࣝͷࠩΛิ׬͢Δ"TTJTUBOU ҟͳΔϞσϧߏ଄ؒͰ஌ࣝΛసҠ %FJ5 <5PVWSPO *$.-`> ஌ࣝͱͯ֬͠཰෼෍Λ༻͍ͯ $//͔Β7J5΁஌ࣝৠཹ 0OFGPS"MM <)BP /FVS*14`> தؒग़ྗΛMPHJUۭؒʹ౤Ө͢Δ͜ͱͰ ҟͳΔߏ଄ͷϞσϧؒͰதؒ૚ৠཹ ஌ࣝৠཹͷࣗಈઃܭ ,5( <.JOBNJ "$$7`> Ϟσϧͱଛࣦͷ૊Έ߹Θͤ 0SBDMF,OPXMFEHF%JTUJMMBUJPO <,BOH """*`> ΞϯαϯϒϧڭࢣͷͨΊͷੜెͷϞσϧߏ଄ Ϋϥεߏ੒΍λεΫ͕ҟͳΔෳ਺ͷڭࢣͷ஌ࣝΛੜెʹू໿ 4UVEFOUCFDPNJOHUIFNBTUFS <:F $713`> ηϚηάΛֶशͨ͠ڭࢣͱਂ౓ਪఆΛֶशͨ͠ڭࢣ "NBMHBNBUJOH,OPXMFEHF <4IFO """*`> ҟͳΔ෼ྨλεΫΛֶशͨ͠ෳ਺ͷڭࢣ ಛఆͷλεΫ ֶश Ϟσϧʹ͓͚Δ஌ࣝΛઃܭ $-*1,% <'BOH $713`> $-*1ɿ$-*1ʹ͓͍ͯ ैདྷͷ஌ࣝͷ༗ޮੑΛௐࠪ .JOJ7J5 <;IBOH $713`> 7JTJPO5SBOTGPSNFSɿ ΞςϯγϣϯॏΈͱύοντʔΫϯ .BOJGPME%JTUJMMBUJPO <)BP /FVS*14`> 7JTJPO5SBOTGPSNFSɿ ύονؒͷؔ܎ੑ -BSHFTDBMFJODSFNFOUBMMFBSOJOH <8V $713`> ܧଓֶशɿաڈλεΫͰ ֶशͨ͠Ϟσϧͷ֬཰෼෍ *NQSPWJOHGBTUTFHNFOUBUJPO XJUIUFBDIFSTUVEFOUMFBSOJOH <9JF #.7$`> ηϚηάɿۙ๣ͷϐΫηϧͱͷMPHJUؔ܎ 4&&% <'BOH *$-3`> ࣗݾڭࢣ͋Γֶशɿ αϯϓϧؒͷؔ܎ੑ -FBSOJOHF⒏DJFOUPCKFDUEFUFDUJPO NPEFMTXJUILOPXMFEHFEJTUJMMBUJPO <;BHPSVZLP *$-3`> ෺ମݕग़ɿ෺ମྖҬͷۣܗ ڭࢣ ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ ੜె ੜె ஌ࣝΛసҠ IUUQTDPO fi UBUMBTKQHVJEFFWFOUTTJJTUBUJD TQFDJBM@QSPKFDU@UFDI@NBQ 44**प೥ٕज़Ϛοϓɿ஌ࣝৠཹ  Ԭຊ௚थ /BPLJ0LBNPUP த෦େֶ%౻٢ݚڀࣨॴଐ ݚڀςʔϚɿϋΠύʔύϥϝʔλ୳ࡧʹΑΔֶशํ๏ͷࣗಈઃܭ ݚڀ෼໺ɹɿ஌ࣝৠཹɼ൒ڭࢣ͋Γֶशɼࣗݾڭࢣ͋Γֶश ݸਓϖʔδ 9
  2. w ݴޠɼը૾ɼಈըͳͲͷ༷ʑͳଟ࣍ݩσʔλʹ͓͍ͯओཁͳϞσϧΞʔΩςΫνϟ  ஫ҙػߏ "UUFOUJPO ʹΑΓτʔΫϯؒͷରԠؔ܎Λଊ͑ͯಛ௃நग़ 5SBOTGPSNFS<7BTXBOJ /FVS*14>  Input-Input

    Layer5 The Law will never be perfect , but its application should be just - this is what we are missing , in my opinion . <EOS> <pad> The Law will never be perfect , but its application should be just - this is what we are missing , in my opinion . <EOS> <pad> <7BTXBOJ /FVS*14>͔ΒҾ༻ɼҰ෦վม
  3. w ݴޠɼը૾ɼಈըͳͲͷ༷ʑͳଟ࣍ݩσʔλʹ͓͍ͯओཁͳϞσϧΞʔΩςΫνϟ  ஫ҙػߏ "UUFOUJPO ʹΑΓτʔΫϯؒͷରԠؔ܎Λଊ͑ͯಛ௃நग़ Input-Input Layer5 The Law

    will never be perfect , but its application should be just - this is what we are missing , in my opinion . <EOS> <pad> The Law will never be perfect , but its application should be just - this is what we are missing , in my opinion . <EOS> <pad> 5SBOTGPSNFS<7BTXBOJ /FVS*14>  <7BTXBOJ /FVS*14>͔ΒҾ༻ɼҰ෦վม ໰୊఺ɿ஫ҙػߏͷܭࢉྔͱϝϞϦ࢖༻ྔ͸γʔέϯε௕ʹରͯ͠ೋ৐Ͱ૿Ճ γʔέϯε௕
  4. w /-1෼໺ʹ͓͚Δঢ়ଶۭؒϞσϧ 44.T ʹج͍ͮͨ৽͍͠ϞσϧΞʔΩςΫνϟ w ߏ଄Խঢ়ଶۭؒϞσϧ 4 <(V *$-3>Λϕʔεʹͭͷख๏Λಋೖ 

    4FMFDUJWF44.T ɿϞσϧ͕ೖྗʹԠͯ͡ύϥϝʔλΛಈతʹௐ੔  ϋʔυ΢ΣΞʹదͨ͠ΞϧΰϦζϜ ɿޮ཰తͳֶशͱਪ࿦Λ࣮ݱ  γϯϓϧͳϞσϧߏ଄ ɿ)º(BUFE.-1ˠ.BNCB .BNCB<(VBOE%BP BS9JW>  H3 Gated MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X X X ! X Conv SSM X ! ! Conv SSM ⨂ Project Discretize !! ℎ!"# ℎ! "! # $! %! Selection Mechanism GPU SRAM GPU HBM ∆! Selective State Space Model with Hardware-aware State Expansion 4FMFDUJWF44.T .BNCB#MPDL ϋʔυ΢ΣΞʹదͨ͠ΞϧΰϦζϜ <(VBOE%BP BS9JW>͔ΒҾ༻ɼҰ෦վม
  5. w ࣌ؒʹ൐͏γεςϜͷಈతͳڍಈΛදݱ͢ΔͨΊͷ਺ֶతϑϨʔϜϫʔΫ<,BMNBO >  ݱࡏͷ࣌ࠁ ʹ͓͚Δೖྗ ͱग़ྗ ͷؔ܎Λঢ়ଶ Λհͯ͠ϞσϧԽ 

    ੍ޚ޻ֶ෼໺Ͱݹ͔͘Βར༻ Ұൠతʹɺଟ͘ͷ44.TͰ͸ग़ྗํఔࣜͷୈ߲Λলུʢ ʣˠਂ૚ֶशϞσϧʹ͓͚ΔεΩοϓ઀ଓͱͯ͠ղऍ t x(t) ∈ ℝ y(t) ∈ ℝ h(t) ∈ ℝN Dx(t) = 0 ঢ়ଶۭؒϞσϧ 44.T4UBUF4QBDF.PEFMT  0VUQVU TFRVFODF *OQVU TFRVFODF 4UBUF4QBDF.PEFMT 44.T y(t) x(t) h′  (t) = Ah(t) + Bx(t) y(t) = Ch(t) + Dx(t) ঢ়ଶભҠߦྻ ɿঢ়ଶͷ࣌ؒมԽ A ∈ ℝN×N ೖྗߦྻ ɿೖྗ͕ঢ়ଶมԽʹ༩͑ΔӨڹΛ੍ޚ B ∈ ℝN×1 ग़ྗߦྻ ɿݱࡏͷঢ়ଶʹج͍ͮͯੜ੒ C ∈ ℝ1×N ೖྗ͕ग़ྗʹ௚઀༩͑ΔӨڹΛܾఆ͢Δ܎਺D ∈ ℝ ঢ়ଶํఔࣜɿ ग़ྗํఔࣜɿ ঢ়ଶ ͷ࣌ؒඍ෼ h(t) A h B C x(t) y(t) D
  6. w ܥྻσʔλʢFH ୯ޠྻʣΛѻ͏ʹ͸࿈ଓදݱΛ཭ࢄԽ͢Δඞཁ͋Γ  ࿈ଓ࣌ؒΛ౳͍͠ੵ෼ྖҬΛ࣋ͭ ݸͷ཭ࢄతͳ۠ؒʹ෼ׂʢ;FSP0SEFS)PME;0)ʣ ؔ਺஋͕۠ؒ ͷؒͰҰఆͰ͋ΔͱԾఆ ;0)ʹΑΔ཭ࢄԽޙͷ44.Tɿ 

    ཭ࢄԽ͞Εͨ44.T͸࠶ؼతͳදݱͱͳΓɺ3//ʹྨࣅͨ͠ߏ଄Λ࣋ͭ ˠશͯͷೖྗʹରͯ͠஫ҙػߏΛܭࢉ͢Δ5SBOTGPSNFSϕʔεͷϞσϧΑΓߴޮ཰ͳਪ࿦͕Մೳ K Δ = [tk−1 , tk ] 44.Tͷ཭ࢄԽ  hk = ¯ Ahk−1 + ¯ Bxk yk = ¯ Chk ¯ A = exp(ΔA) ¯ B = (ΔA)−1(exp(ΔA) − I) ⋅ ΔB xt xt−1 xt+1 ht ht−1 ht+1 yt yt−1 yt+1 ¯ A ¯ A ¯ A ¯ A ¯ B ¯ B ¯ B C C C ʢ ཭ࢄతͳ࣌ؒεςοϓʣ k ঢ়ଶํఔࣜɿ ग़ྗํఔࣜɿ
  7. w 44.TʹೖྗʹԠͯ͡ύϥϝʔλΛಈతʹௐ੔͢Δબ୒ϝΧχζϜΛಋೖ  ೖྗ Λೖྗͱ͢Δઢܗ૚ͷग़ྗΛ44.ͷύϥϝʔλ   ʹ࢖༻ xt Bt

    Ct Δt 4FMFDUJWF44.T  ˠೖྗ ʹԠͯ͡อ࣋͞ΕΔ৘ใΛ੍ޚ xt Project Discretize !! ℎ!"# ℎ! "! # $! %! Selection Mechanism GPU SRAM GPU HBM ∆! Selective State Space Model with Hardware-aware State Expansion ࣍ݩ D જࡏঢ়ଶ(N = 4) ࣍ݩ D (B ∈ ℝN×D) (C ∈ ℝD×N) (A ∈ ℝN×N) 1 − gt gt ೖྗ ग़ྗ જࡏঢ়ଶ ht = (1 − gt )ht−1 + gt xt ͱԾఆͨ͠৔߹: N = 1, A = − 1, B = 1, gt = σ(Linear(xt )) <(VBOE%BP BS9JW>͔ΒҾ༻ɼҰ෦վม
  8. .BNCBͷޮՌ  ܭࢉίετɾ଎౓ͷൺֱ <2V BS9JW`>͔ΒҾ༻ capability allows SSMs to achieve

    not only e￿cient inference but also parallel training. However, it should be noted that the most conventional SSMs are time-invariant, meaning that their A, B, C, and are unrelated to the model input G. This would limit context-aware modeling, which leads to inferior performance of SSMs in certain tasks such as selective copying [55]. Table 1. Pros and cons of three primary architectures-RNNs, Transformers, and SSMs-in auto-regressive sequential modeling tasks. Comparison RNNs Transformers SSMs Training Speed Slow (Recurrent) Fast (Parallel) Fast (Convolutional) Inference Speed Fast (Recurrent) Slow (Quadratic-Time) Fast (Recurrent) Complexity $(!⇡2) $(!2⇡) $(!⇡2) Modeling Capabilities (Hidden State) (Attention) (Time-Invariance) Manuscript submitted to ACM θϩγϣοτੑೳͷൺֱ <(VBOE%BP BS9JW>͔ΒҾ༻ɼҰ෦վม RWKV (B. Peng et al. 2023) which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.) Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size. M￿￿￿￿ T￿￿￿￿. P￿￿￿ LAMBADA LAMBADA H￿￿￿￿S￿￿￿ PIQA A￿￿￿E A￿￿￿C W￿￿￿G￿￿￿￿￿ A￿￿￿￿￿￿ ￿￿￿ # ￿￿￿ # ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " Hybrid H3-130M GPT2 — 89.48 25.77 31.7 64.2 44.4 24.2 50.6 40.1 Pythia-160M NeoX 29.64 38.10 33.0 30.2 61.4 43.2 24.1 51.9 40.6 Mamba-130M NeoX 10.56 16.07 44.3 35.3 64.5 48.0 24.3 51.9 44.7 Hybrid H3-360M GPT2 — 12.58 48.0 41.5 68.1 51.4 24.7 54.1 48.0 Pythia-410M NeoX 9.95 10.84 51.4 40.6 66.9 52.1 24.6 53.8 48.2 Mamba-370M NeoX 8.28 8.14 55.6 46.5 69.5 55.1 28.0 55.3 50.0 Pythia-1B NeoX 7.82 7.92 56.1 47.2 70.7 57.0 27.1 53.5 51.9 Mamba-790M NeoX 7.33 6.02 62.7 55.1 72.1 61.2 29.5 56.1 57.1 GPT-Neo 1.3B GPT2 — 7.50 57.2 48.9 71.1 56.2 25.9 54.9 52.4 Hybrid H3-1.3B GPT2 — 11.25 49.6 52.6 71.3 59.2 28.1 56.9 53.0 OPT-1.3B OPT — 6.64 58.0 53.7 72.4 56.7 29.6 59.5 55.0 Pythia-1.4B NeoX 7.51 6.08 61.7 52.1 71.0 60.5 28.5 57.2 55.2 RWKV-1.5B NeoX 7.70 7.04 56.4 52.5 72.4 60.5 29.4 54.6 54.3 Mamba-1.4B NeoX 6.80 5.04 64.9 59.1 74.2 65.5 32.8 61.5 59.7 GPT-Neo 2.7B GPT2 — 5.63 62.2 55.8 72.1 61.1 30.2 57.6 56.5 Hybrid H3-2.7B GPT2 — 7.92 55.7 59.7 73.3 65.6 32.3 61.4 58.0 OPT-2.7B OPT — 5.12 63.6 60.6 74.8 60.8 31.3 61.0 58.7 Pythia-2.8B NeoX 6.73 5.04 64.7 59.3 74.0 64.1 32.9 59.7 59.1 RWKV-3B NeoX 7.00 5.24 63.9 59.6 73.7 67.8 33.1 59.6 59.6 Mamba-2.8B NeoX 6.22 4.23 69.2 66.1 75.2 69.7 36.3 63.5 63.3 GPT-J-6B GPT2 – 4.10 68.3 66.3 75.4 67.0 36.6 64.1 63.0 OPT-6.7B OPT – 4.25 67.7 67.2 76.3 65.6 34.9 65.5 62.9 Pythia-6.9B NeoX 6.51 4.45 67.1 64.0 75.2 67.3 35.5 61.3 61.7 RWKV-7.4B NeoX 6.31 4.38 67.2 65.5 76.1 67.8 37.5 61.0 62.5 4.3 DNA Modeling Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a ￿nite  4.2.2 Downstream Evaluations Table 3 shows the performance of Mamba on a range of popular downstream zero-shot evaluation tasks. We compare against the most well-known open source models at these sizes, most importantly Pythia (Biderman et al. 2023) and RWKV (B. Peng et al. 2023) which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.) Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size. M￿￿￿￿ T￿￿￿￿. P￿￿￿ LAMBADA LAMBADA H￿￿￿￿S￿￿￿ PIQA A￿￿￿E A￿￿￿C W￿￿￿G￿￿￿￿￿ A￿￿￿￿￿￿ ￿￿￿ # ￿￿￿ # ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " ￿￿￿ " Hybrid H3-130M GPT2 — 89.48 25.77 31.7 64.2 44.4 24.2 50.6 40.1 Pythia-160M NeoX 29.64 38.10 33.0 30.2 61.4 43.2 24.1 51.9 40.6 Mamba-130M NeoX 10.56 16.07 44.3 35.3 64.5 48.0 24.3 51.9 44.7 Hybrid H3-360M GPT2 — 12.58 48.0 41.5 68.1 51.4 24.7 54.1 48.0 Pythia-410M NeoX 9.95 10.84 51.4 40.6 66.9 52.1 24.6 53.8 48.2 Mamba-370M NeoX 8.28 8.14 55.6 46.5 69.5 55.1 28.0 55.3 50.0 Pythia-1B NeoX 7.82 7.92 56.1 47.2 70.7 57.0 27.1 53.5 51.9 Mamba-790M NeoX 7.33 6.02 62.7 55.1 72.1 61.2 29.5 56.1 57.1 GPT-Neo 1.3B GPT2 — 7.50 57.2 48.9 71.1 56.2 25.9 54.9 52.4 Hybrid H3-1.3B GPT2 — 11.25 49.6 52.6 71.3 59.2 28.1 56.9 53.0 OPT-1.3B OPT — 6.64 58.0 53.7 72.4 56.7 29.6 59.5 55.0 Pythia-1.4B NeoX 7.51 6.08 61.7 52.1 71.0 60.5 28.5 57.2 55.2 RWKV-1.5B NeoX 7.70 7.04 56.4 52.5 72.4 60.5 29.4 54.6 54.3 Mamba-1.4B NeoX 6.80 5.04 64.9 59.1 74.2 65.5 32.8 61.5 59.7 GPT-Neo 2.7B GPT2 — 5.63 62.2 55.8 72.1 61.1 30.2 57.6 56.5 ˠͷϞσϧαΠζͰ5SBOTGPSNFSΑΓߴ͍ੑೳ ˠ44.Tʢ.BNCBͷϕʔεख๏ʣ͸γʔέϯγϟϧͳॲཧˠ3//Tͱಉ౳ͷਪ࿦଎౓ɾܭࢉྔ
  9. .BNCBͷ$7෼໺΁ͷਐల  IUUQTTQFBLFSEFDLDPNIGUJZVUPSJBSVNBNCBWJTJPONBNCBWJN TMJEF͔ΒҾ༻ 1BQFSMJTUIUUQTQBQFSESPQCPYDPNEPD.BNCBQBQFSMJTU$650EY6/I-'WTBH#PJd/"2T)JR.RZKSMGRP";HK.W .BNCBͷ$7෼໺΁ͷల։ʢ೥݄̔ʣ ೥݄ ݄ ೥݄ ݄

    ݄ ݄ ݄ .BNCB <(VBOE%BP BS9JW`> .BNCB <(VBOE%BP *$.-`> ݄ /-1෼໺ .BNCB/% <-J &$$7`> 7JTJPO38,7 <%VBO BS9JW`> -PDBM.BNCB <)VBOH &$$78`> &⒏DJFOU7.BNCB <1FJ BS9JW`> 1MBJO.BNCB <:BOH #.7$`> .VMUJ4DBMF7.BNCB <4IJ /FVS*14`> .BNCB3 <8BOH BS9JW`> %FNZTUJGZ.BNCB <)BO /FVS*14`> W)FBU <8BOH BS9JW`> 7JN' <;IBOH BS9JW`> 7JTJPO.BNCB <;IV *$.-`> 7.BNCB <-JV BS9JW`> .BNCB7JTJPO <)BUBNJ[BEFIBOE,BVU[ BS9JW`> (SPVQ.BNCB <4IBLFS BS9JW`> $7෼໺΁ͷల։ 6OJpFE*NQMJDJU"UUFOUJPO'PSNVMBUJPO <;JNFSNBO BS9JW`> .BNCB-31 <+BGBSJ /FVS*14`> ࢹ֮తઆ໌ "VUPSFHSFTTJWF1SFUSBJOJOH <3FO BS9JW`> ࣗݾڭࢣ͋Γֶश .BNCB:0-0 <8BOH BS9JW`> 'VTJPO.BNCB <%POH BS9JW`> ෺ମମݕग़ˍηάϝϯςʔγϣϯ 3F.BNCFS <:BOH &$$7`> 4JBNFTF.BNCB <8BO BS9JW`> .BNCBPS38,7 <:VBO BS9JW`> ϚϧνϞʔμϧ 7-.BNCB <2JBP BS9JW`> $-*1.BNCB <)VBOH BS9JW`> (SPPU7- <9JBP /FVS*14`> 4).BNCB <:BOH BS9JW`> .-.BNCB <)VBOHBOE)V BS9JW`> #&7.BNCB <-JV 5FYI3YJW`> 0&#FW4FH <4VO BS9JW`> '&3:0-0.BNCB <.B BS9JW`> 40"3 <7FSNB BS9JW`> ηάϝϯςʔγϣϯ ෺ମݕग़ #&71FSDFQUJPO  &$$7࠾୒࿦จͷ BS9JWͰͷެ։࣌ظ
  10. w ࿦จ਺ɿຊ  .BNCB/%4FMFDUJWF4UBUF4QBDF.PEFMJOHGPS.VMUJ%JNFOTJPOBM%BUB 0SBM1BQFS   3F.BNCFS3FGFSSJOH*NBHF4FHNFOUBUJPOXJUI.BNCB5XJTUFS  7JEFP.BNCB4QBUJP5FNQPSBM4FMFDUJWF4UBUF4QBDF.PEFM

     7JEFP.BNCB4UBUF4QBDF.PEFMGPS& ffi DJFOU7JEFP6OEFSTUBOEJOH  .BNCB*3"4JNQMF#BTFMJOFGPS*NBHF3FTUPSBUJPOXJUI4UBUF4QBDF.PEFM  .PUJPO.BNCB& ffi DJFOUBOE-POH4FRVFODF.PUJPO(FOFSBUJPO  ;JH.B"%J5TUZMF;JH[BH.BNCB%J ff VTJPO.PEFM  .5.BNCB&OIBODJOH.VMUJ5BTL%FOTF4DFOF6OEFSTUBOEJOHCZ.BNCB#BTFE%FDPEFST &$$7ͷ.BNCB࿦จ 
  11. w ࿦จ਺ɿຊ  .BNCB/%4FMFDUJWF4UBUF4QBDF.PEFMJOHGPS.VMUJ%JNFOTJPOBM%BUB 0SBM1BQFS   3F.BNCFS3FGFSSJOH*NBHF4FHNFOUBUJPOXJUI.BNCB5XJTUFS  7JEFP.BNCB4QBUJP5FNQPSBM4FMFDUJWF4UBUF4QBDF.PEFM

     7JEFP.BNCB4UBUF4QBDF.PEFMGPS& ffi DJFOU7JEFP6OEFSTUBOEJOH  .BNCB*3"4JNQMF#BTFMJOFGPS*NBHF3FTUPSBUJPOXJUI4UBUF4QBDF.PEFM  .PUJPO.BNCB& ffi DJFOUBOE-POH4FRVFODF.PUJPO(FOFSBUJPO  ;JH.B"%J5TUZMF;JH[BH.BNCB%J ff VTJPO.PEFM  .5.BNCB&OIBODJOH.VMUJ5BTL%FOTF4DFOF6OEFSTUBOEJOHCZ.BNCB#BTFE%FDPEFST &$$7ͷ.BNCB࿦จ 
  12. w .BNCBΛ೚ҙͷଟ࣍ݩσʔλʹ֦ுͯ͠ҰൠԽͨ͠.BNCB/%ΛఏҊ  .BNCB ɿ࣍ݩγʔέϯεͰ͋ΔςΩετͰධՁ  .BNCB/% ɿը૾ʢ࣍ݩʣɼಈը૾ʢ࣍ݩʣɼՊֶσʔλͰධՁ w ଟ࣍ݩσʔλͷ࣍ݩγʔέϯεԽʹؔ͢ΔઃܭΛ໢ཏతʹධՁ

    .BNCB/%<-J &$$7>  Patchfy BLK H- BLK H+ BLK W+ BLK W- BLK T+ BLK T- X N Mamba-ND Block Unpatchify T X H X W L X D num_classes Cls Head T X H X W Linear SSM Linear 1D-SSM Block Conv1D W+ W- H+ H W W+ W- H+ H- H W W+ W- H+ H- H W ࣍ݩσʔλͷγʔέϯεԽ
  13. w ଟ࣍ݩσʔλΛ࣍ݩγʔέϯεʹม׵͢ΔͨΊͷγʔέϯεॱংͷ૊Έ߹ΘͤΛઃܭ  γʔέϯγϟϧͳಛ௃நग़ɿͭͷσʔλ఺ؒͰ৘ใΛϛοΫεˠ૒ํ޲PSଟํ޲ͷઃܭ͕ඞཁ w ૒ํ޲΍ଟํ޲ͷ৘ใϛοΫεΛ૚ϨϕϧͱϒϩοΫϨϕϧͰઃܭ  ૚Ϩϕϧɹɹɹɿ૚಺෦Ͱෳ਺ํ޲ͷγʔέϯεؔ܎Ͱಛ௃நग़  ϒϩοΫϨϕϧɿ֤ϒϩοΫͰҟͳΔํ޲ͷγʔέϯεؔ܎Ͱಛ௃நग़

    ଟ࣍ݩσʔλͷ࣍ݩγʔέϯεԽ  H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H+H-W+W-T+T- [H+H-][W+W-][T+T-] [H+H-W+W-][T+T-] Linear SSM Linear 1D-SSM Block ConvND L+ Linear Bi-SSM Block L- Linear H+ Linear N-SSM Block H- W+ W- ConvND+Split H+ Linear Multi-Head-SSM Block H- W+ W- Concat Conv1D Linear ConvNd Linear ૚Ϩϕϧ ϒϩοΫϨϕϧ
  14. w ଟ࣍ݩσʔλͷγʔέϯεॱྻͷީิ͸ແ਺ʹଘࡏ  ܗঢ়͕ ͷ ࣍ݩσʔλͷγʔέϯε௕͸ ˠॱྻͷ૯਺͸ ௨Γ w ࣍ݩͷํ޲ʹԊͬͨॱྻͷΈΛީิͱͯ͠࢖༻ˠॎྻͷ૯਺͸

    ௨Γ  ࣍ݩσʔλɿ8 8 ) )Λ࢖༻  ࣍ݩσʔλɿ8 8 ) ) 5 5Λ࢖༻ D1 × D2 × ⋯ × DN N L = ΠN i=1 Di L! 2N! γʔέϯεॱྻ  W+ W- H+ H- H W W+ W- H+ H- H W W+ W- H+ H- H W W+ W- H+ H- H W Patchfy T X H X W L X D t0 t1 t2 T- Patchfy BLK H- BLK H+ T X H X W L X D t0 t1 t2 T+
  15. w ͭͷ44.ϒϩοΫʢ૚ʣͷ಺෦Ͱෳ਺ͷγʔέϯεॱྻΛߟྀ  ͭͷॱྻΛߟྀ͢Δैདྷͷ44.ϒϩοΫʹՃ͑ɼछྨͷ44.ϒϩοΫΛઃܭ ૚Ϩϕϧͷઃܭ  r Linear 1D-SSM Block

    ConvND L+ Linear Bi-SSM Block L- Linear H+ Linear N-SSM Block H- W+ W- ConvND+Split H+ Linear Multi-Head-SSM Block H- W+ W- Concat D Linear ConvNd Linear #J44.#MPDL ͋Δͭͷ࣍ݩʹԊͬͯ ॱํ޲ ͱٯํ޲  ͷ44.Λద༻ Linear SSM Linear 1D-SSM Block ConvND L+ Linear Bi-SSM Block L- Linear H+ Linear N-SSM Block H- W+ W- ConvND+Split H+ Linear Multi-Head-SSM Block H- W+ W- Concat Conv1D Linear ConvNd Linear ֤࣍ݩʹԊͬͯ ॱํ޲ ͱٯํ޲  ͷ44.Λద༻ /44.#MPDL Linear SSM Linear 1D-SSM Block ConvND L+ Linear Bi-SSM Block L- Linear H+ Linear N-SSM Block H- W+ W- ConvND+Split H+ Linear Multi-Head-SSM Block H- W+ W- Concat Conv1D Linear ConvNd Linear ಛ௃ϕΫτϧΛ ݸʹ෼ׂ ෼ׂ֤ͨ͠ϕΫτϧʹରͯ͠ ҟͳΔγʔέϯεॱྻͷ44.Λద༻ 2N .VMUJ)FBE44.#MPDL
  16. w 44.ϒϩοΫͷ഑ஔํ๏ʹΑΓෳ਺ͷγʔέϯεॱྻΛߟྀ  छྨͷ഑ஔํ๏Λઃܭ ϒϩοΫϨϕϧͷઃܭ  ަޓํ޲ ϒϩοΫͰͭͷॱྻΛߟྀ ֤࣍ݩͷॱํ޲ ͱٯํ޲

     Λ ަޓʹ੾Γସ͑ͳ͕Β௚઀઀ଓ ϒϩοΫͰ ॱํ޲ ͱٯํ޲  ͷ྆ํΛߟྀ ʢ#J44.ϒϩοΫΛ࢖༻ʣ ૒ํ޲ )ͱ8Λۭؒ࣍ݩɼ5Λ࣌ؒ࣍ݩ ͱͯ͠ಠཱͨ͠ϒϩοΫͰߟྀ ࢛ํ޲ H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H+H-W+W-T+T- [H+H-][W+W-][T+T-] [H+H-W+W-][T+T-] H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H+H-W+W-T+T- [H+H-][W+W-][T+T-] [H+H-W+W-][T+T-] H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H- H+ W+ W- T+ T- H+H-W+W-T+T- [H+H-][W+W-][T+T-] [H+H-W+W-][T+T-]
  17. w γϯϓϧͳઃܭΛ࠾༻  44.ϒϩοΫ ɿͭͷॱྻͷΈΛߟྀ͢Δ%44.#MPDLʢඪ४తͳ44.ϒϩοΫʣ  ϒϩοΫͷ഑ஔ ɿަޓํ޲ w .BNCB/%ͷೖग़ྗ

     ೖྗɿଟ࣍ݩσʔλΛύονԽ࣍͠ݩγʔέϯεʹม׵ͯ͠ೖྗ  ग़ྗɿ࣍ݩγʔέϯεΛݩͷଟ࣍ݩσʔλͷܗঢ়ʹ࠶ߏஙͯ͠ग़ྗ ࠷ऴతͳ.BNCB/%ͷϞσϧߏ଄  Patchfy BLK H- BLK H+ BLK W+ BLK W- BLK T+ BLK T- X N Mamba-ND Block Unpatchify T X H X W L X D num_classes Cls Head T X H X W Linear SSM Linear 1D-SSM Block Conv1D
  18. Ϟσϧߏ଄ʹΑΔੑೳมԽ  44.ϒϩοΫͷߏ଄ Mamba-ND 13 Table 6: Ablation Study on

    Layer Designs. We report top-1 accuracy on the ImageNet-1K validation set. The Alt-Directional design is the top-performing one. IN1K→ HMDB-51 → Alt-Directional Block Level 79.4 59.0 Multi-Head-SSM Layer-Level 77.6 51.5 ND-SSM Layer-Level 77.2 46.7 1D-SSM - 76.4 34.9 Bi-SSM Layer-Level 74.6 32.1 5.5 Meta Architectures We perform extensive ablation studies on various design choices mentioned in Sec. 4. We show that the alternating-directional design is the simplest and most e!ective one among a wide range of possible choices. Layer Design We use ImageNet-1K and HMDB-51 classification as our bench- marks. We perform an ablation study on Mamba-2D and Mamba-3D. We adopted various designs mentioned in Sec. 4.2. Results show that our final alternating- directional design achieves stronger performance than all proposed layer-wise changes. We show results in Tab. 6. Compared with the 1D-Mamba baseline, in which no special design is adopted after naively flattening the multi-dimensional sequence into a 1D sequence, we achieved a +4.8 accuracy on ImageNet and a +26.9 accuracy on HMDB-51. Additionally, we observe that most of the proposed multi-directional designs are able to outperform the naive and bi-directional ˠ%44.Λަޓʹ઀ଓ͢Δઃܭ "MU%JSFDUJPOBM ͕࠷΋ߴ͍ੑೳΛൃش 44.ϒϩοΫͷ഑ஔ 14 Li. et al. Table 7: Ablation Studies (a) Ablation Study on Layer Ar- rangement. We report top-1 accuracy on the HMDB-51 dataset. H+H-W+W- T+T- is the top-performing design. Ordering HMDB→ H+H-W+W-T+T- 59.0 [H+H-W+W-T+T-] 38.3 [H+H-][W+W-][T+T-] 49.8 [H+H-W+W-][T+T-] 47.4 (b) Ablation Study on Various Factorization Poli- cies + indicates layer factorization. In all studies, the total number of layers is fixed. B: batch size, and D: length of a single dimension. Factorization HMDB→ Mem #Sequences 1D+1D+1D 44.5 80GB O(BD 2) 2D+1D 55.8 77GB O(BD 2) 2D+3D 51.9 18GB O(BD) 3D 59.0 17GB O(B) ˠͭͷ44.ϒϩοΫͰଟํ޲Λߟྀ͢Δͱੑೳ͕௿Լ ʢஶऀΒɿ.BNCB/%#MPDL಺ͷ44.ϒϩοΫͷ਺͕ҟͳΔ఺ʹ஫ҙ͕ඞཁʣ
  19. w 44.ͷܭࢉྔ͸γʔέϯε௕ʹରͯ͠ઢܗʹ૿Ճʢ஫ҙػߏͱൺ΂ͯ௿͍ܭࢉྔʣ  ඪ४తͳΞϓϩʔνͰ͸ɼͭͷೖྗσʔλ͔ΒͭͷγʔέϯεΛ࡞੒ ˠγʔέϯε௕͸ೖྗσʔλͷ֤࣍ݩͷ௕͞ʹґଘ w ͭͷೖྗσʔλ͔Βෳ਺ͷখ͞ͳγʔέϯεʹ෼ղͯ͠ฒྻܭࢉ  ͭͷೖྗσʔλʹରͯ͠อ࣋͢ΔӅΕঢ়ଶͷ਺͕૿͑ΔͨΊɼϝϞϦ࢖༻ྔ͸૿Ճ γʔέϯεͷҼ਺෼ղ

     1 Sequence D Sequences D2 Sequences Ҽ਺෼ղʹΑΔޮՌ 14 Li. et al. Table 7: Ablation Studies (a) Ablation Study on Layer Ar- rangement. We report top-1 accuracy on the HMDB-51 dataset. H+H-W+W- T+T- is the top-performing design. Ordering HMDB→ H+H-W+W-T+T- 59.0 [H+H-W+W-T+T-] 38.3 [H+H-][W+W-][T+T-] 49.8 [H+H-W+W-][T+T-] 47.4 (b) Ablation Study on Various Factorization Poli- cies + indicates layer factorization. In all studies, the total number of layers is fixed. B: batch size, and D: length of a single dimension. Factorization HMDB→ Mem #Sequences 1D+1D+1D 44.5 80GB O(BD 2) 2D+1D 55.8 77GB O(BD 2) 2D+3D 51.9 18GB O(BD) 3D 59.0 17GB O(B)
  20. w .BNCB  ঢ়ଶۭؒϞσϧʹج͍ͮͨ/-1෼໺ͷ৽͍͠ϞσϧΞʔΩςΫνϟ  5SBOTGPSNFSͱൺ΂ͯ௿͍ܭࢉྔɼͷύϥϝʔλ਺Ͱಉ౳Ҏ্ͷੑೳΛൃش w .BNCB/%  .BNCBΛ೚ҙͷଟ࣍ݩσʔλʹ֦ுͯ͠ҰൠԽ

     ଟ࣍ݩσʔλ΁ͷ֦ுʹؔ͢ΔઃܭΛ໢ཏతʹධՁˠγϯϓϧͳઃܭ͕ߴ͍ੑೳΛൃش  ΑΓখ͍͞ϞσϧαΠζͰ5SBOTGPSNFSΑΓߴ͍ੑೳΛൃش ·ͱΊ