Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[Journal club ] PHyCLIP: ðð-Product of Hyperbol...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
Technology
36
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[Journal club ] PHyCLIP: ðð-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation
keio_smilab
PRO
0
99
[Journal club] ReLaGS: Relational Language Gaussian Splatting
keio_smilab
PRO
0
95
[Journal club] Flow as the Cross-Domain Manipulation Interface
keio_smilab
PRO
0
88
Mobi-ð: Mobilizing Your Robot Learning Policy
keio_smilab
PRO
0
150
A Gentle Introduction to Transformers
keio_smilab
PRO
16
6.7k
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
keio_smilab
PRO
0
58
[Journal club] VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
keio_smilab
PRO
0
130
[Journal club] Improved Mean Flows: On the Challenges of Fastforward Generative Models
keio_smilab
PRO
0
190
[Journal club] MemER: Scaling Up Memory for Robot Control via Experience Retrieval
keio_smilab
PRO
0
140
Other Decks in Technology
See All in Technology
瀟å AI ãšãŒãžã§ã³ã Synapse ãš ã»ãã³ãã£ãã¯ã¬ã€ã€ãŒã®è²ãŠæ¹
hiroakis
0
360
ããŒã¯ã³æ°ããããŠãã¯æž¬ããªã â Claude Code çµç¹å±éã®å¹ææ€èšŒããåŠãããããš
makikub
0
140
AIã«ãããŒãäœãããããšããŠæ«æãã話
hamatsutaichi
0
220
Dario AmodiãPolicy on the AI Exponentialããçè§£ãã
nagatsu
0
200
AIãã©ãããã©ãŒã ãéçšãç¶ããããã®å¯èŠ³æž¬æ§
tanimuyk
4
1.2k
å®è£ ã¯éããªã£ããã¬ãã¥ãŒã¯ã©ãããïŒ â èªèº«ã®ã¬ãã¥ãŒãAIã§åçŸããããµãŒãŽã¡ã³ããšã³ãžãã¢ãªã³ã°ã®ããã / Implementation got faster. So what about reviews? â An invitation to Servant Engineering: Recreating your own code reviews with AI
nrslib
7
4.2k
äŒç€Ÿç޹ä»è³æ / Sansan Company Profile
sansan33
PRO
18
420k
BigQuery ã® Cross-cloud Lakehouse ãžã®æ©ã¿
phaya72
2
600
éçšãèŠæ®ããAIãšãŒã·ãã§ã³ãèšèšå®è·µ
amacbee
1
3.2k
ããŒã¿åºç€ãDataformã§æŽãã話 ã éçºç°å¢ãæ·»ã㊠ã
takapy
0
130
ããŒã ã§å®è·µãã AI-DLC æèã®è»è·¡ãæ®ããã§ãã¯ãã€ã³ãèšèš
belongadmin
0
3k
çŸå³ããã¹ã€ã¹ããŒãºãäœããð§ð
taigamikami
1
260
Featured
See All Featured
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
6k
ã³ãŒãã®90%ãAIãæžãäžçã§äœãåŸ ã£ãŠããã®ã / What awaits us in a world where 90% of the code is written by AI
rkaga
62
44k
Exploring anti-patterns in Rails
aemeredith
3
400
How to make the Groovebox
asonas
2
2.2k
Odyssey Design
rkendrick25
PRO
2
690
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
330
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
240
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
220
Rails Girls ZÃŒrich Keynote
gr2m
96
14k
Designing Experiences People Love
moore
143
24k
Building Applications with DynamoDB
mza
96
7.1k
Raft: Consensus for Rubyists
vanstee
141
7.5k
Transcript
PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality
in Vision- Language Representation Learning ICLR26 æ ¶æçŸ©å¡Ÿå€§åŠ ææµŠåæç 究宀 é«ç§æå² Daiki Yoshikawa1, Takashi Matsubara1, 2 1Hokkaido University, 2CyberAgent Daiki Yoshikawa, et al. PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning. ICLR2026.
2 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å æŠèŠ â¢ â¢ â¢ â¢
3 éå±€æ§ãšæ§ææ§ãåæã«è¡šçŸããããšã¯é£ãã VLMãæ±ãã¹ã2çš®é¡ã®æå³æ§é â« éå±€æ§ (hierarchy) â« èšèªæŠå¿µã¯æšæ§é çã«åé¡ã§ãã (e.g., WordNet
[Miller, 95]) â« äŸ: dog ⪯ mammal ⪯ animal â« äžäœã®æŠå¿µã»ã©å ·äœç â« æ§ææ§ (compositionality) â« äŸ: âa dog in a carâ â« ç»åãæç« ã¯è€æ°æŠå¿µã®å ±èµ· CLIP [Radford+, ICML21] ã¯åäžã®ãŠãŒã¯ãªãã空éäžã®ïŒã€ã®ãã¯ãã«ãšããŠè¡šçŸ ï hierarchy ãš compositionality ãåäžç©ºéã§åæã«è¡šçŸã§ããªã èæ¯ (1/3) ⢠⢠⢠â¢
4 åæ²ç©ºé㯠hierarchy ãèªç¶ã«è¡šçŸã§ãã Poincaré Embeddings [Nickel+, NeurIPS17] â« èæ¯
â« åèªã»ã°ã©ãã«ã¯æœåšç㪠hierarchy ãååš â« äœæ¬¡å ã®ãŠãŒã¯ãªãã空éã§ã¯æ·±ãéå±€æ§é ã 衚ããªã (âµ âð: å€é åŒç éå±€æ§é : ææ°é¢æ°ç) â« ææ¡: ãã¢ã³ã«ã¬ã¢ãã«ãžã®åãèŸŒã¿ â« åæ²ç©ºéã§ã¯ç©ºéãææ°é¢æ°çã«åºãã â é£ç¶çãªæšæ§é ãšããŠéå±€æ§é ãèªç¶ã«è¡šçŸ â« ãã«ã ð ãéå±€, è·é¢ ð ð, ð ãé¡äŒŒåºŠã衚ã â« çµæ â« WordNet [Miller, 95] ã®ãããªå€§èŠæš¡åé¡äœç³»ã®åã蟌㿠⺠衚çŸå®¹éã»æ±åæ§èœãšãã«åŸæ¥ææ³ãåé§ âº ç¹ã«äœæ¬¡å ã§ãé«ã粟床ãç¶æ èæ¯ (2/3) ⢠⢠⢠⢠WordNet ã®åºä¹³é¡ subtree ãåæ²ç©ºé (ð = 2) ã§èšç·Ž
5 â« ç»åã»æç« ã¯è€æ°æŠå¿µã®å ±èµ·ãšããŠè¡šãã â« âa dog in a carâ â
{dog, car} â« âa cat and a bikeâ â {cat, bike} ï è€æ°æŠå¿µã®å ±èµ·ã hierarchy ã衚ãåäžã®åæ²ç©ºéã§è¡šçŸã§ããªã â« ããŒã«ä»£æ°ãšããŠã®è§£é â« atomic concepts: ð¶ = {ð1 , ð2 , ⊠, ðð } â« è€åæŠå¿µ: ð â ð¶ â« å atomic concept ãå«ãŸãããã©ãã bit ã§æãã â è€åæŠå¿µ ð, ð ã®è·é¢ã¯ããã³ã°è·é¢ ðð -product (ååæ²ç©ºéã®è·é¢ã®å) Compositionality 㯠Boolean-like ãªæ§é ãæã€ èæ¯ (3/3) ⢠⢠⢠⢠ð¶ = {dog, cat, car, bike} ð = {dog, car}, ð = {dog} ð ð = 1,0,1,0 ð ð = 1,0,0,0 ããã³ã°è·é¢: ðHam ð ð , ð ð = 1 ðð -product: ð1 ð, ð = à· ð=1 ð ð âð ð ð¥ ð , ðŠ ð
6 Vision-Language Representation Learning â¢ â¢ â¢ â¢ ææ³ æŠèŠ
ç¹åŸŽ CLIP [Radford+, ICML21] ç»åã»ããã¹ããåäžã®ãŠãŒã¯ãªã ã空éãžåå ï hierarchy ã compositionality ãæç€ºçã« æ±ããªã MERU [Desai+, ICML23] CLIP ã®åã蟌ã¿ç©ºéãåæ²ç©ºéãž æ¡åŒµ ⺠hierarchy ã®æšæ§é ãè¡šçŸ ï compositionality ã¯èæ ®ããŠããªã HyCoCLIP [Pal+, ICLR25] bounding box supervision ãå°å ¥ hyperbolic entailment cone ãå°å ¥ ⺠object-level ã® hierarchy ãæç€ºçã«åŠç¿ ï compositionality ã®æ±ãã¯éå®ç é¢é£ç ç©¶ MERU HyCoCLIP
7 PHyCLIP ã®å šäœå â« è€æ°ã® hyperbolic factor ã®ç©ºéãžåã蟌ã â« ð
åã® ð æ¬¡å åæ²ç©ºé âð ð â å šäœã§ ðð æ¬¡å ææ¡ææ³ (1/3) ⢠⢠⢠â¢
8 ç»åã»ããã¹ããåæ²ç©ºéãžåã蟌ã â« åæ²ç©ºéãžã®åã蟌㿠⫠ðð æ¬¡å ç¹åŸŽéã ð åã«åå² â«
åå²ãã ð ð ãåæ²ç©ºéã«åå ð ð â âð â ð ð â âð ð â« è·é¢ã®å®çŸ© (ðð -product metric) ð1 ð¿, ð = à· ð=1 ð ð âð ð ð(ð), ð(ð) ðavg ð¿, ð = 1 ð ð1 ð¿, ð â« object-level ã«ã¯ãããããç»åã»ããã¹ããäœ¿çš â« å ¥å: ð°, ð», ð°box, ð»box â« image 㯠text ããå ·äœç â« å ã® image/text ã¯ã¯ãããããããã®ããå ·äœç ææ¡ææ³ (2/3) ⢠⢠⢠⢠Entailment Relation ð° ⪯ ð» ð°box ⪯ ð»box ð° ⪯ ð°box ð» ⪯ ð»box
9 Loss function: 察å¿é¢ä¿ãšéå±€é¢ä¿ãåæã«åŠç¿ æå€±é¢æ°: âoverall = âcont + ðŸâent
ææ¡ææ³ (3/3) ⢠⢠⢠⢠Contrastive Loss â« æšæºç㪠InfoNCE âcont {ð¿ð }, {ðð } = â à· ðâðµ log exp âðavg ð¿ð , ðð /ð Ï ðâðµ exp âðavg ð¿ð , ðð /ð â« ãã¹ãŠã®ãã¢ã§å¹³å âcont = 1 4 ൬ ൰ âcont {ð°ð }, {ð»ð } + âcont {ð»ð }, {ð°ð } + âcont {ð°ð box}, {ð»ð box} + âcont {ð»ð box}, {ð°ð box} Entailment Loss â« entailment cone ã§é åºé¢ä¿ã衚ã ð ð â ð¶ ð ð ⺠ð ð ⪯ ð ð â« entailment cone ããå€ããã眰å âent, ð ð¿, ð = max 0, ð ð ð , ð ð â ðð ð ð âent ð¿, ð = 1 ð à· ð=1 ð âent, ð ð¿, ð ð ð ð , ð ð : y ãã x ã®è§åºŠ ð ð ð : cone ã®åéå£è§ ð: ããŒãžã³
10 GRIT ãçšããåŠç¿ â« èšç·ŽããŒã¿ã»ãã â« GRIT [Peng+, ICCV23]: èªåã¢ãããŒã·ã§ã³ããã
image-text ã㢠+ bbox â« 14.0M image-text pairs / 26.6M box annotations â« PHyCLIP ã®èšå® â« ð = 64, ð = 8 (åèš: 512次å ) â« ðŸ = 0.2 â« optimizer: AdamW â« å®éšç°å¢ â« GPU: A100 Ã4 â« iterations: 500,000 â« batch size: 768 å®éšèšå® ⢠⢠⢠â¢
11 â« Zero-shot Image Classification ⺠PHyCLIP ã¯å šäœãéããŠæ¢åææ³ãäžåã (specialized ã¯
GRIT ã®ååžå€) ⺠ç¹ã« General ã§é«ãã¹ã³ã¢ â è€æ°ã®åæ²ç©ºéã«ãã concept families ã®çè§£ãæå¹ PHyCLIP ã¯ç»ååé¡ã¿ã¹ã¯ã§æ¢åææ³ãäžåã å®éççµæ (1/3) ⢠⢠⢠â¢
12 PHyCLIP 㯠retrieval ãšéå±€åé¡ã§æ¢åææ³ãäžåã â« Zero-shot Retrieval & Hierarchical
Classification ⺠PHyCLIP ã¯ã»ãšãã©ã® retrieval ææšã§æ¢åææ³ãäžåã ⺠Hierarchical Classification (äºæž¬ã©ãã«ãš GT ãã©ãã ã WordNet äžã§è¿ãã) ã® å šãŠã®ææšã§æ¢åææ³ãäžåã å®éççµæ (2/3) ⢠⢠⢠â¢
13 PHyCLIP 㯠compositionality ã®çè§£ãæ¹å â« Compositional Understanding â« ãã£ãã·ã§ã³ã®äžéšã倿Žãã
hard negative ãã GT ã®ãã£ãã·ã§ã³ãèå¥ â« VL-CheckList-Object: ãã£ãã·ã§ã³äžã®ç©äœãå¥ã®ç©äœã«çœ®æ â« SugarCrepe: object/attribute/relation ã«å¯Ÿã㊠replace/swap/add ⺠VL-CheckList-Object ã§ã¯å šãŠã®ãµãã»ããã§ PHyCLIP ãæ¢åææ³ãäžåã â äœçœ®ã倧ããã«é å¥ã«ç©äœã®ååšãè¡šçŸ ï relation replacement ã object swapping ã§ã¯æ§èœãäœäž â Boolean-like ãªèšèšã«ããç©äœå士ã®é¢ä¿æ§ã®çè§£ã«åŒ±ã å®éççµæ (3/3) ⢠⢠⢠â¢
14 â« ç»åã®ãã«ã ã¯ããã¹ãã®ãã«ã ãã倧ããçãç¯å²ã«éäž (âµ ç»åã¯ããã¹ãããå ·äœç: ð°ð ⪯ ð»ð ) â«
åã ã® factor å ã§ã¯ããããã®ãã«ã ã®ååžãéãªãåºã忣 ⺠PHyCLIP ã¯åã蟌ã¿ç©ºéã®åºãé åãæŽ»çš åã ã® factor ã§åã蟌ã¿ç©ºéãæå¹æŽ»çš 宿§ççµæ (1/2) ⢠⢠⢠â¢
15 â« dog 㯠â39 ð , car 㯠â9
ð 㧠掻æ§å â« dog and car ã§ã¯åæã«æŽ» æ§å â« â39 ð ã§ã¯åºä¹³é¡, â9 ð ã§ã¯ä¹ãç©/æ¥çšå ã®éå±€æ§é ãçŸãã å hyperbolic factor ã¯æŠå¿µããšã® hierarchy ã衚ã 宿§ççµæ (2/2) ⢠⢠⢠â¢
18 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å ãŸãšã ⢠⢠⢠â¢
19 Poincaré Embeddings [Nickel+, NeurIPS17] ã®è©³çް â« Poincaré ã¢ãã« â«
Riemannian metric tensor ðð¥ = 2 1â ð 2 2 ððž (ððž : Euclidean metric tensor) â« ç¹ ð¢, ð£ â â¬ð¹ éã®è·é¢ ð ð, ð = arcosh 1 + 2 ð â ð 2 1 â ð 2 1 â ð 2 â« Optimization ðœð¡+1 â ðððð ðœð¡ â ðð¡ 1 â ðœð¡ 2 2 4 âðž â« Loss â Î = à· ð¢,ð£ âð log ðâð ð,ð Ï ðâ²âð© ð¢ ðâð ð,ðâ² Appendix (1/4) ⢠⢠⢠â¢
20 PHyCLIP ã®å®è£ 詳现 Appendix (2/4) ⢠⢠⢠⢠PHyCLIP
㯠Lorents model [Nickel+, ICML18] ã§ hyperbolic factor ãå®è£ (æ²ç âð¶ð 㯠learnable) â« Minkowski inner product: æéæ¹åã®ã¿è² ã®å ç© à· ð = ð¥0 , ð¥1 , ⊠, ð¥ð , ð = ð¥1 , ⊠, ð¥ð â âð à· ð, à· ð âð,1 = âð¥0 ðŠ0 + ð, ð âð â« åæ²ç©ºéãåæ²é¢ãšããŠè¡šçŸ ððŒ ð = à· ð â âð,1 à· ð, à· ð âð,1 = âðŒâ1, ð¥0 > 0 â« Lorentz distance ð ððŒ ð à· ð, à· ð = ðŒâ1/2 arccosh âðŒ à· ð, à· ð âð,1 â« Exponential map: ð ãåæ²ç©ºéäžã®ç¹ãžåå à· ð = expà· ðš ðŒ ð = cosh ðŒ ð à· ð + sinh ðŒ ð ðŒ ð ð â« Entailment Cones in the Lorents Model ð ð = sinâ1 min 1, 2ðŸ ðŒ ð âð ð ð, ð = cosâ1 ð¥0 + ðŒ ð, ð âðŒ ð ðŠ0 ð âð ðŒ ð, ð âðŒ ð 2 â 1
21 Ablation Study Appendix (3/4) ⢠⢠⢠â¢
22 Additional Visualizations Appendix (4/4) ⢠⢠⢠â¢