Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[Journal club ] PHyCLIP: ðð-Product of Hyperbol...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
Technology
36
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[Journal club ] PHyCLIP: ðð-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Semantic Machine Intelligence Lab., Keio Univ.
PRO
May 27, 2026
More Decks by Semantic Machine Intelligence Lab., Keio Univ.
See All by Semantic Machine Intelligence Lab., Keio Univ.
[Journal club] ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation
keio_smilab
PRO
0
99
[Journal club] ReLaGS: Relational Language Gaussian Splatting
keio_smilab
PRO
0
95
[Journal club] Flow as the Cross-Domain Manipulation Interface
keio_smilab
PRO
0
88
Mobi-ð: Mobilizing Your Robot Learning Policy
keio_smilab
PRO
0
150
A Gentle Introduction to Transformers
keio_smilab
PRO
16
6.7k
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
keio_smilab
PRO
0
58
[Journal club] VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
keio_smilab
PRO
0
130
[Journal club] Improved Mean Flows: On the Challenges of Fastforward Generative Models
keio_smilab
PRO
0
190
[Journal club] MemER: Scaling Up Memory for Robot Control via Experience Retrieval
keio_smilab
PRO
0
140
Other Decks in Technology
See All in Technology
GoãšSIMDãšWasmã®ä»ã
askua
3
510
Claude code Orchestra
ozakiomumkj
3
1k
Terraformã¢ãžã¥ãŒã«ã¯ããªããéå¢ãåããã®ã
hayama17
2
220
ããŒã¿åºç€ãDataformã§æŽãã話 ã éçºç°å¢ãæ·»ã㊠ã
takapy
0
130
Oracle AI Database@Google CloudïŒãµãŒãã¹æŠèŠã®ã玹ä»
oracle4engineer
PRO
6
1.5k
éçšãèŠæ®ããAIãšãŒã·ãã§ã³ãèšèšå®è·µ
amacbee
1
3.2k
SIer20å¹ŽïŒ å¹ã£ãã¹ãã«ãã¹ã¿ãŒãã¢ããã§èŒãæ
shucho0103
0
710
Databricks ã«ããã çæAIã¬ããã³ã¹ã®å®è·µ
taka_aki
1
350
AI ãã¬ã³ããªãŒãªãšã©ãŒç£èŠã TypeScript ã§å®çŸãã
shinyaigeek
2
270
è£œé æ¥ã®ã¯ã©ãŠãæŽ»çšæé©è§£ãAIïŒDXãå éããããŒã¿åºç€ã®äœãæ¹ã
hamadakoji
0
410
ã³ãŒãã¬ãã¥ãŒãå¶ããããŒã ããœãããŠã§ã¢ããªããªãŒã®ãããŒãå¶ã / Beyond Code Review: Distributing Its Responsibilities Across the SDLC
mtx2s
4
1.3k
10åã®çç£æ§ãå®çŸããAIé§å䞊åãšãŒãžã§ã³ãã®ãã¹ãŠ
kumaiu
4
930
Featured
See All Featured
Rebuilding a faster, lazier Slack
samanthasiow
85
9.5k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
180
Thoughts on Productivity
jonyablonski
76
5.2k
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
330
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2.3k
Leo the Paperboy
mayatellez
7
1.8k
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
550
We Have a Design System, Now What?
morganepeng
55
8.2k
It's Worth the Effort
3n
188
29k
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
560
Transcript
PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality
in Vision- Language Representation Learning ICLR26 æ ¶æçŸ©å¡Ÿå€§åŠ ææµŠåæç 究宀 é«ç§æå² Daiki Yoshikawa1, Takashi Matsubara1, 2 1Hokkaido University, 2CyberAgent Daiki Yoshikawa, et al. PHyCLIP: ðð -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning. ICLR2026.
2 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å æŠèŠ â¢ â¢ â¢ â¢
3 éå±€æ§ãšæ§ææ§ãåæã«è¡šçŸããããšã¯é£ãã VLMãæ±ãã¹ã2çš®é¡ã®æå³æ§é â« éå±€æ§ (hierarchy) â« èšèªæŠå¿µã¯æšæ§é çã«åé¡ã§ãã (e.g., WordNet
[Miller, 95]) â« äŸ: dog ⪯ mammal ⪯ animal â« äžäœã®æŠå¿µã»ã©å ·äœç â« æ§ææ§ (compositionality) â« äŸ: âa dog in a carâ â« ç»åãæç« ã¯è€æ°æŠå¿µã®å ±èµ· CLIP [Radford+, ICML21] ã¯åäžã®ãŠãŒã¯ãªãã空éäžã®ïŒã€ã®ãã¯ãã«ãšããŠè¡šçŸ ï hierarchy ãš compositionality ãåäžç©ºéã§åæã«è¡šçŸã§ããªã èæ¯ (1/3) ⢠⢠⢠â¢
4 åæ²ç©ºé㯠hierarchy ãèªç¶ã«è¡šçŸã§ãã Poincaré Embeddings [Nickel+, NeurIPS17] â« èæ¯
â« åèªã»ã°ã©ãã«ã¯æœåšç㪠hierarchy ãååš â« äœæ¬¡å ã®ãŠãŒã¯ãªãã空éã§ã¯æ·±ãéå±€æ§é ã 衚ããªã (âµ âð: å€é åŒç éå±€æ§é : ææ°é¢æ°ç) â« ææ¡: ãã¢ã³ã«ã¬ã¢ãã«ãžã®åãèŸŒã¿ â« åæ²ç©ºéã§ã¯ç©ºéãææ°é¢æ°çã«åºãã â é£ç¶çãªæšæ§é ãšããŠéå±€æ§é ãèªç¶ã«è¡šçŸ â« ãã«ã ð ãéå±€, è·é¢ ð ð, ð ãé¡äŒŒåºŠã衚ã â« çµæ â« WordNet [Miller, 95] ã®ãããªå€§èŠæš¡åé¡äœç³»ã®åã蟌㿠⺠衚çŸå®¹éã»æ±åæ§èœãšãã«åŸæ¥ææ³ãåé§ âº ç¹ã«äœæ¬¡å ã§ãé«ã粟床ãç¶æ èæ¯ (2/3) ⢠⢠⢠⢠WordNet ã®åºä¹³é¡ subtree ãåæ²ç©ºé (ð = 2) ã§èšç·Ž
5 â« ç»åã»æç« ã¯è€æ°æŠå¿µã®å ±èµ·ãšããŠè¡šãã â« âa dog in a carâ â
{dog, car} â« âa cat and a bikeâ â {cat, bike} ï è€æ°æŠå¿µã®å ±èµ·ã hierarchy ã衚ãåäžã®åæ²ç©ºéã§è¡šçŸã§ããªã â« ããŒã«ä»£æ°ãšããŠã®è§£é â« atomic concepts: ð¶ = {ð1 , ð2 , ⊠, ðð } â« è€åæŠå¿µ: ð â ð¶ â« å atomic concept ãå«ãŸãããã©ãã bit ã§æãã â è€åæŠå¿µ ð, ð ã®è·é¢ã¯ããã³ã°è·é¢ ðð -product (ååæ²ç©ºéã®è·é¢ã®å) Compositionality 㯠Boolean-like ãªæ§é ãæã€ èæ¯ (3/3) ⢠⢠⢠⢠ð¶ = {dog, cat, car, bike} ð = {dog, car}, ð = {dog} ð ð = 1,0,1,0 ð ð = 1,0,0,0 ããã³ã°è·é¢: ðHam ð ð , ð ð = 1 ðð -product: ð1 ð, ð = à· ð=1 ð ð âð ð ð¥ ð , ðŠ ð
6 Vision-Language Representation Learning â¢ â¢ â¢ â¢ ææ³ æŠèŠ
ç¹åŸŽ CLIP [Radford+, ICML21] ç»åã»ããã¹ããåäžã®ãŠãŒã¯ãªã ã空éãžåå ï hierarchy ã compositionality ãæç€ºçã« æ±ããªã MERU [Desai+, ICML23] CLIP ã®åã蟌ã¿ç©ºéãåæ²ç©ºéãž æ¡åŒµ ⺠hierarchy ã®æšæ§é ãè¡šçŸ ï compositionality ã¯èæ ®ããŠããªã HyCoCLIP [Pal+, ICLR25] bounding box supervision ãå°å ¥ hyperbolic entailment cone ãå°å ¥ ⺠object-level ã® hierarchy ãæç€ºçã«åŠç¿ ï compositionality ã®æ±ãã¯éå®ç é¢é£ç ç©¶ MERU HyCoCLIP
7 PHyCLIP ã®å šäœå â« è€æ°ã® hyperbolic factor ã®ç©ºéãžåã蟌ã â« ð
åã® ð æ¬¡å åæ²ç©ºé âð ð â å šäœã§ ðð æ¬¡å ææ¡ææ³ (1/3) ⢠⢠⢠â¢
8 ç»åã»ããã¹ããåæ²ç©ºéãžåã蟌ã â« åæ²ç©ºéãžã®åã蟌㿠⫠ðð æ¬¡å ç¹åŸŽéã ð åã«åå² â«
åå²ãã ð ð ãåæ²ç©ºéã«åå ð ð â âð â ð ð â âð ð â« è·é¢ã®å®çŸ© (ðð -product metric) ð1 ð¿, ð = à· ð=1 ð ð âð ð ð(ð), ð(ð) ðavg ð¿, ð = 1 ð ð1 ð¿, ð â« object-level ã«ã¯ãããããç»åã»ããã¹ããäœ¿çš â« å ¥å: ð°, ð», ð°box, ð»box â« image 㯠text ããå ·äœç â« å ã® image/text ã¯ã¯ãããããããã®ããå ·äœç ææ¡ææ³ (2/3) ⢠⢠⢠⢠Entailment Relation ð° ⪯ ð» ð°box ⪯ ð»box ð° ⪯ ð°box ð» ⪯ ð»box
9 Loss function: 察å¿é¢ä¿ãšéå±€é¢ä¿ãåæã«åŠç¿ æå€±é¢æ°: âoverall = âcont + ðŸâent
ææ¡ææ³ (3/3) ⢠⢠⢠⢠Contrastive Loss â« æšæºç㪠InfoNCE âcont {ð¿ð }, {ðð } = â à· ðâðµ log exp âðavg ð¿ð , ðð /ð Ï ðâðµ exp âðavg ð¿ð , ðð /ð â« ãã¹ãŠã®ãã¢ã§å¹³å âcont = 1 4 ൬ ൰ âcont {ð°ð }, {ð»ð } + âcont {ð»ð }, {ð°ð } + âcont {ð°ð box}, {ð»ð box} + âcont {ð»ð box}, {ð°ð box} Entailment Loss â« entailment cone ã§é åºé¢ä¿ã衚ã ð ð â ð¶ ð ð ⺠ð ð ⪯ ð ð â« entailment cone ããå€ããã眰å âent, ð ð¿, ð = max 0, ð ð ð , ð ð â ðð ð ð âent ð¿, ð = 1 ð à· ð=1 ð âent, ð ð¿, ð ð ð ð , ð ð : y ãã x ã®è§åºŠ ð ð ð : cone ã®åéå£è§ ð: ããŒãžã³
10 GRIT ãçšããåŠç¿ â« èšç·ŽããŒã¿ã»ãã â« GRIT [Peng+, ICCV23]: èªåã¢ãããŒã·ã§ã³ããã
image-text ã㢠+ bbox â« 14.0M image-text pairs / 26.6M box annotations â« PHyCLIP ã®èšå® â« ð = 64, ð = 8 (åèš: 512次å ) â« ðŸ = 0.2 â« optimizer: AdamW â« å®éšç°å¢ â« GPU: A100 Ã4 â« iterations: 500,000 â« batch size: 768 å®éšèšå® ⢠⢠⢠â¢
11 â« Zero-shot Image Classification ⺠PHyCLIP ã¯å šäœãéããŠæ¢åææ³ãäžåã (specialized ã¯
GRIT ã®ååžå€) ⺠ç¹ã« General ã§é«ãã¹ã³ã¢ â è€æ°ã®åæ²ç©ºéã«ãã concept families ã®çè§£ãæå¹ PHyCLIP ã¯ç»ååé¡ã¿ã¹ã¯ã§æ¢åææ³ãäžåã å®éççµæ (1/3) ⢠⢠⢠â¢
12 PHyCLIP 㯠retrieval ãšéå±€åé¡ã§æ¢åææ³ãäžåã â« Zero-shot Retrieval & Hierarchical
Classification ⺠PHyCLIP ã¯ã»ãšãã©ã® retrieval ææšã§æ¢åææ³ãäžåã ⺠Hierarchical Classification (äºæž¬ã©ãã«ãš GT ãã©ãã ã WordNet äžã§è¿ãã) ã® å šãŠã®ææšã§æ¢åææ³ãäžåã å®éççµæ (2/3) ⢠⢠⢠â¢
13 PHyCLIP 㯠compositionality ã®çè§£ãæ¹å â« Compositional Understanding â« ãã£ãã·ã§ã³ã®äžéšã倿Žãã
hard negative ãã GT ã®ãã£ãã·ã§ã³ãèå¥ â« VL-CheckList-Object: ãã£ãã·ã§ã³äžã®ç©äœãå¥ã®ç©äœã«çœ®æ â« SugarCrepe: object/attribute/relation ã«å¯Ÿã㊠replace/swap/add ⺠VL-CheckList-Object ã§ã¯å šãŠã®ãµãã»ããã§ PHyCLIP ãæ¢åææ³ãäžåã â äœçœ®ã倧ããã«é å¥ã«ç©äœã®ååšãè¡šçŸ ï relation replacement ã object swapping ã§ã¯æ§èœãäœäž â Boolean-like ãªèšèšã«ããç©äœå士ã®é¢ä¿æ§ã®çè§£ã«åŒ±ã å®éççµæ (3/3) ⢠⢠⢠â¢
14 â« ç»åã®ãã«ã ã¯ããã¹ãã®ãã«ã ãã倧ããçãç¯å²ã«éäž (âµ ç»åã¯ããã¹ãããå ·äœç: ð°ð ⪯ ð»ð ) â«
åã ã® factor å ã§ã¯ããããã®ãã«ã ã®ååžãéãªãåºã忣 ⺠PHyCLIP ã¯åã蟌ã¿ç©ºéã®åºãé åãæŽ»çš åã ã® factor ã§åã蟌ã¿ç©ºéãæå¹æŽ»çš 宿§ççµæ (1/2) ⢠⢠⢠â¢
15 â« dog 㯠â39 ð , car 㯠â9
ð 㧠掻æ§å â« dog and car ã§ã¯åæã«æŽ» æ§å â« â39 ð ã§ã¯åºä¹³é¡, â9 ð ã§ã¯ä¹ãç©/æ¥çšå ã®éå±€æ§é ãçŸãã å hyperbolic factor ã¯æŠå¿µããšã® hierarchy ã衚ã 宿§ççµæ (2/2) ⢠⢠⢠â¢
18 PHyCLIP: éå±€æ§ãšæ§ææ§ãèæ ®ããåæ²ç©ºéãžã®åãèŸŒã¿ â« èæ¯ â« VLM 㯠éå±€æ§ (hierarchy)
ãš æ§ææ§ (compositionality) ã®äž¡æ¹ãæ±ã â« CLIP [Radford+, ICML21] ã¯åäžãŠãŒã¯ãªãã空éãžã®åã蟌㿠â hierarchy ãš compositionality ãåæã«è¡šçŸããããšãé£ãã â« åæ²ç©ºé㯠hierarchy ã®è¡šçŸã«é©ããäžæ¹, compositionality ã衚çŸãã«ãã â« ææ¡ææ³: PHyCLIP â« è€æ°ã® hyperbolic factor ã® ð1 -product 空éãžã®åãèŸŒã¿ â« è€æ° factor ã®åææŽ»æ§åã«ãã compositionality ãè¡šçŸ â« çµæ â« zero-shot ã® classification / retrieval ã§æ¢åææ³ãäžåã â« hierarchy ã®è¡šçŸã compositionality ã®çè§£ãæ¹å ãŸãšã ⢠⢠⢠â¢
19 Poincaré Embeddings [Nickel+, NeurIPS17] ã®è©³çް â« Poincaré ã¢ãã« â«
Riemannian metric tensor ðð¥ = 2 1â ð 2 2 ððž (ððž : Euclidean metric tensor) â« ç¹ ð¢, ð£ â â¬ð¹ éã®è·é¢ ð ð, ð = arcosh 1 + 2 ð â ð 2 1 â ð 2 1 â ð 2 â« Optimization ðœð¡+1 â ðððð ðœð¡ â ðð¡ 1 â ðœð¡ 2 2 4 âðž â« Loss â Î = à· ð¢,ð£ âð log ðâð ð,ð Ï ðâ²âð© ð¢ ðâð ð,ðâ² Appendix (1/4) ⢠⢠⢠â¢
20 PHyCLIP ã®å®è£ 詳现 Appendix (2/4) ⢠⢠⢠⢠PHyCLIP
㯠Lorents model [Nickel+, ICML18] ã§ hyperbolic factor ãå®è£ (æ²ç âð¶ð 㯠learnable) â« Minkowski inner product: æéæ¹åã®ã¿è² ã®å ç© à· ð = ð¥0 , ð¥1 , ⊠, ð¥ð , ð = ð¥1 , ⊠, ð¥ð â âð à· ð, à· ð âð,1 = âð¥0 ðŠ0 + ð, ð âð â« åæ²ç©ºéãåæ²é¢ãšããŠè¡šçŸ ððŒ ð = à· ð â âð,1 à· ð, à· ð âð,1 = âðŒâ1, ð¥0 > 0 â« Lorentz distance ð ððŒ ð à· ð, à· ð = ðŒâ1/2 arccosh âðŒ à· ð, à· ð âð,1 â« Exponential map: ð ãåæ²ç©ºéäžã®ç¹ãžåå à· ð = expà· ðš ðŒ ð = cosh ðŒ ð à· ð + sinh ðŒ ð ðŒ ð ð â« Entailment Cones in the Lorents Model ð ð = sinâ1 min 1, 2ðŸ ðŒ ð âð ð ð, ð = cosâ1 ð¥0 + ðŒ ð, ð âðŒ ð ðŠ0 ð âð ðŒ ð, ð âðŒ ð 2 â 1
21 Ablation Study Appendix (3/4) ⢠⢠⢠â¢
22 Additional Visualizations Appendix (4/4) ⢠⢠⢠â¢