Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PhD Defense 2025: Visual Understanding of Human...

PhD Defense 2025: Visual Understanding of Human Hands in Interactions

Avatar for Take Ohkawa

Take Ohkawa

July 14, 2025
Tweet

Other Decks in Research

Transcript

  1. PhD thesis defense Visual Understanding of Human Hands in Interactions

    ΠϯλϥΫγϣϯʹ͓͚Δखͷࢹ֮తཧղ 2025 7.14 Takehiko Ohkawa (Advisor: Prof. Yoichi Sato) / The University of Tokyo
  2. 2 Explore Discover Love Emphasize Express Ubiquitous role of hands

    alongside life stage changes Dexterity Tool use Profession Recall, we've grown with our Hands...
  3. 3 Video understanding AR glasses VR game Photorealistic telepresence Sign

    language understanding Robot imitation This centricity of hands open border applications
  4. Spectrum of understanding • Location, pixel-mask, keypoints, 3D pose, and

    shape • Richer better due to inclusion relation (e.g, 2D kpt ⊂ 3D kpt) • Data always limit the level of understanding 4 Fine & 3D Hand-object detection
 [D. Shan+, CVPR’20] Hand segmentation
 [G. Serra+, ACMMM’13] 2D pose (keypoints)
 [T. Simon+, CVPR’17] 3D pose
 [G. Moon+, ECCV’20] 3D shape
 [C.Zimmermann+, ICCV’19]
  5. Spectrum of data captures • Data vary from studio to

    in-the-wild settings ( fi delity vs diversity) 5 [1] T. Ohkawa, R. Furuta, and Y. Sato. E ffi cient annotation and learning for 3D hand pose estimation: A survey. IJCV, 2023 Internet videos
 (100DOH)
 [D. Shan+, CVPR’20] Multi-camera dome
 (InterHand2.6M)
 [G. Moon+, ECCV’20, 
 C. Wuu+, arXiv’22] Multi-camera desktop
 (DexYCB / HO3D)
 [Y-W. Chao+, CVPR’21,
 S. Hampali+, CVPR’20] Wild ego videos
 (Ego4D)
 [K. Grauman+, CVPR’22] Diversity Fidelity RGB-D camera
 (Dexter+Object / FPHA)
 [S. Sridhar+, ECCV’16,
 G. Garcia-Hernando+, CVPR’18]
  6. Research questions • What are challenging scenarios in data capture

    and perception under status quo? • What are learning issues under the tradeo ff between data fi delity and diversity? • How are tracked hand states linked to semantics? 6
  7. Q1: Challenging scenarios • Contact involves heavy occlusion and ambiguity

    7 Object contact
 (chapter 3) Self-contact
 (chapter 4)
  8. Q1: Challenging scenarios • Contact involves heavy occlusion and ambiguity

    • Strong inductive cues of semantics 8 Object contact
 (chapter 3) Self-contact
 (chapter 4) Action intent Object affordance Emotion Psychological states
  9. Q2: Learning issues • Learning needs to accommodate two data

    sources • Limited size and variety of labeled data • Diverse in-the-wild data with limited labels and di ff erent views 9 Training data (studio) Support data or test data (in-the-wild) Prior modeling from large data (chapter 4,5) Adaptation to test domain (chapter 6)
  10. Q3: Semantic mapping • A geometric state can be interpreted

    in various ways (i.e., one-to-many) 10 “Grasp something” “Drink co ff ee” “Crack an egg” “Tighten bolt” Learn to map hand states to semantics dependent on the context (chapter 7)
  11. 11 Visual Understanding of Human Hands in Interactions Data 


    foundation Robust modeling for fi ne details Connecting geometry with semantics ––Precise tracking and interpretation of fi ne-grained hand interactions from real-world visual data––
  12. 12 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) Chapter 7: Video language description from hand tracklets (WACV’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose
  13. Table of contents I. Introduction II. Survey for 3D hand

    capture, annotation, and learning methods (IJCV’23) III. Egocentric hand pose estimation under object interactions (ECCV’24) IV. Hand self-contact benchmark and generative pose modeling (ICCV’25) V. 3D hand pose pre-training from in-the-wild images (ICLR’25) VI. Domain adaptive hand state estimation in the wild (ECCV'22) VII.Dense video captioning for egocentric hand activities (WACV’25) VIII.Conclusions and Future Work 13
  14. 14 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3
  15. Benchmark for egocentric hand pose • Facilitate analysis on natural

    and intricate object interactions 1. Propose 3D hand pose benchmark using a multi-view egocentric headset 2. Extensive analysis of recent estimation methods at ICCV 2023 competition 3. Identify insights and fi ndings by aggregating community knowledge 15 [2] Z. Fan*, T. Ohkawa*+, Benchmarks and challenges in pose estimation for egocentric hand interactions with objects. In ECCV, 2024 Exo views Ego views
  16. Progress of 3D hand pose estimation 16 HANDS’17 
 [S.

    Yuan+, CVPR’18] HANDS’19
 [A. Armagan+, ECCV’20] • Depth images • Marker-based annotation
 →Bias in RGB images • Static RGB images • Object interactions • Markerless annotation
 →Less accurate in single-view setup • Simple and scripted actions ? HANDS’23 • Address object interactions from RGB images • Can we dive into more dynamic settings? • Egocentric cameras? • Unscripted actions? • Intricate contact? • How do we annotate accurately?
  17. AssemblyHands 17 [A] T. Ohkawa,+ AssemblyHands: Towards egocentric activity understanding

    via 3D hand pose estimation. In CVPR, 2023 • Unscripted object assembly captures with ego-exo cameras • Large and accurate 3D pose GTs with markerless annotation (3M images) • Egocentric capture with a multi-view headset (aligned with commercial AR/VR devices)
  18. Multi-view annotation • Multi-view triangulation with volumetric features for 3D

    pose annotation • Collaborated with Meta for data capture 18 [A] T. Ohkawa,+ AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In CVPR, 2023
  19. Task overview 19 Train: 3D hand pose estimation from egocentric

    views 3D hand joints (GT) SVEgoNet Test: Multi-view inference Multi-view fusion Exo images Ego images Multi-view annotation
  20. Competition results at ICCV 2023 20 Task 1 - Egocentric

    3D Hand Pose Estimation - HANDS@ICCV2023, Link: https://sites.google.com/view/hands2023/challenges/assemblyhands Method Learning design Architecture Preprocessing Augmentation Multi-view fusion MPJPE↓ Base 2D heatmap and 3D location map ResNet50 - Simple average 20.69 JHands Regression Hiera 
 (ViT-based) Warp perspective, color jitter, random mask Adaptive view selection and average 12.21 PICO-AI 2.5D heatmap Heatmap voting ResNety320 Scale, rotate, fl ip, translate Adaptive view selection FTL in training 12.46 FRDC Regression 2D heatmap HandOccNet with ConvNeXt Scale, rotate, color jitter Weighted average 16.48 Phi-AI 2D heatmap and 3D location map ResNet50 Scale, rotate, translate, color jitter, gaussian blur Weighted average 17.26
  21. Finding (i): Distortion from egocentric camera • Stretched hands near

    the image edge due to the fi sheye distortion • Proposal: Perspective transformation to de fi ne less stretched hand images (JHands) 21 Set a virtual camera on the hand
  22. Finding (ii): Bias of detected hands per view 22 •

    Proposal: Adaptive multi-view fusion • Weighted sum of views, computed from the validation set (FRDC & Phi-AI) • Con fi dent view selection by fi ltering outlier views (JHands & PICO-AI) Frequent detection of hands Cam1 Cam2 Cam3 Cam4
  23. Summary • Accurate multi-view annotation in AssemblyHands (with Meta at

    Redmond) • Benchmark egocentric hand pose estimation from AR/VR headset • Analyze winning methods at the ICCV 2023 competition • Address egocentric-speci fi c challenges • Fisheye distortion • Adaptive multi-view fusion • Organize HANDS workshop with global consortium 
 (ICCV’23,25 & ECCV’24) • AssemblyHands-related works won 
 EgoVis Distinguished Paper Award (CVPR’25) 23
  24. 24 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose
  25. Self-contact data and generative modeling 25 [3] T. Ohkawa+, Generative

    modeling of shape-dependent self-contact human poses. In ICCV, 2025 Multi-camera dome capture Prior fi tting to novel subjects Generative pose prior • Address complex self-contact pose modeling 1. Create a new hand self-contact dataset with varying poses and shapes 2. Learn self-contact pose distribution with generative modeling 3. Apply the generative prior into single-view pose estimation
  26. Di ffi culty of estimating self-contact poses 26 Common failures

    in a single-view • Contact ambiguity • Depth ambiguity Contact prior is needed to correct poses in such failures
  27. Limitations of existing self-contact datasets 27 Human3DSC
 [M. Fieraru+, AAAI’21]

    MTP
 [L. Muller+, CVPR’21] • Mocap-based capture • Manual contact part annotation • Small contact data (1.0K) • Mocap-based capture & mimic the pose • Pseudo-GTs lacking accuracy • Small contact data (1.6K)
  28. Goliath-SC dataset • Scaling up the capture based on 200+

    RGB camera dome • Various self-contact interactions across face, body, and hands (383K poses) • Collaborated with Meta for data capture 28 Multi-camera dome
 (InterHand2.6M)
 [G. Moon+, ECCV’20, 
 C. Wuu+, arXiv’22]
  29. Self-contact pose modeling • Pose and body shape are closely

    correlated! → Model body-shape-dependent pose! • Formulation of generative contact prior 29 <latexit sha1_base64="N0/iA5QMe68sOJ9d/xFjYA1QWlA=">AAADCHichVJNa9wwEJWdfqTbr016LBTRZSE9ZLGXNs0lEOil7WkL2SSwXowsy14R+QNpnGaRdeylf6WXHlpKr/0JvfXfVF6bsE0CHRB6vJl58zQoKgVX4Hl/HHfj1u07dzfv9e4/ePjocX9r+1gVlaRsSgtRyNOIKCZ4zqbAQbDTUjKSRYKdRGdvmvzJOZOKF/kRLEs2z0ia84RTApYKt5xnk0CwBHaCqBCxWmb20gEsGBCDa7zOvjM4kDxdwIvecIg/hoAP8Bi33T7exQGwC9CKp1nBY9PydZ2Eflej18UuzKWabU3C8f+K6jocd/jSRTtytQQtWWy0NWVd1+saZdj6IlVq1Xax1vrImFAvjblhYmkahW5AMzLsD7yRtwp8HfgdGKAuJmH/dxAXtMpYDlQQpWa+V8JcEwmcCmZ6QaVYSegZSdnMwpxkTM316g0GDy0T46SQ9uSAV+x6hyaZapzayozAQl3NNeRNuVkFyf5c87ysgOW0HZRUAkOBm1+BYy4ZBbG0gFDJrVdMF0QSCvbv9OwS/KtPvg6OxyN/b/Tqw8vB4ftuHZvoKXqOdpCPXqND9BZN0BRR55PzxfnmfHc/u1/dH+7PttR1up4n6J9wf/0FVzf6Ow==</latexit> P (ω|I) Pose Shape Image Mesh Contact Instructed to perform “rubbing belly” Agnostic to environmental conditions vs. regression
  30. Denoising di ff usion network Part-aware Pose Di ff usion

    (PAPoseDi ff ) 30 <latexit sha1_base64="tNR/nGrP+Kiazv1MHB0tDBaX3PI=">AAAB73icbVBNS8NAEN3Ur1q/qh69LBbBU0lEqseCF09SwX5AG8pmO2mXbjZxdyKU0D/hxYMiXv073vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzaHJYxnrTsAMSKGgiQIldBINLAoktIPxzcxvP4E2IlYPOEnAj9hQiVBwhlbq9HAEyPphv1xxq+4cdJV4OamQHI1++as3iHkagUIumTFdz03Qz5hGwSVMS73UQML4mA2ha6liERg/m987pWdWGdAw1rYU0rn6eyJjkTGTKLCdEcORWfZm4n9eN8Xw2s+ESlIExReLwlRSjOnseToQGjjKiSWMa2FvpXzENONoIyrZELzll1dJ66Lq1aq1+8tK/S6Po0hOyCk5Jx65InVySxqkSTiR5Jm8kjfn0Xlx3p2PRWvByWeOyR84nz8k+JAY</latexit> ✓f <latexit sha1_base64="a4X/J/QRMeMPLFPv82JGjMnU4gg=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi1WPBiyepYD+gDWWz3TRLN7thdyKU0J/hxYMiXv013vw3btsctPpg4PHeDDPzwlRwg5735ZTW1jc2t8rblZ3dvf2D6uFRx6hMU9amSijdC4lhgkvWRo6C9VLNSBIK1g0nN3O/+8i04Uo+4DRlQULGkkecErRSf4AxQzLMdTwbVmte3VvA/Uv8gtSgQGtY/RyMFM0SJpEKYkzf91IMcqKRU8FmlUFmWErohIxZ31JJEmaCfHHyzD2zysiNlLYl0V2oPydykhgzTULbmRCMzao3F//z+hlG10HOZZohk3S5KMqEi8qd/++OuGYUxdQSQjW3t7o0JppQtClVbAj+6st/Seei7jfqjfvLWvOuiKMMJ3AK5+DDFTThFlrQBgoKnuAFXh10np03533ZWnKKmWP4BefjG8bSkaI=</latexit> ✓rh <latexit sha1_base64="gLF6PeVdVUU5/T56yG+ZU/3xiAQ=">AAAB8nicbVBNS8NAEN3Ur1q/qh69BIvgqSQi1WPBiyepYD+gDWWznTRLN7thdyKU0J/hxYMiXv013vw3btsctPpg4PHeDDPzwlRwg5735ZTW1jc2t8rblZ3dvf2D6uFRx6hMM2gzJZTuhdSA4BLayFFAL9VAk1BAN5zczP3uI2jDlXzAaQpBQseSR5xRtFJ/gDEgHeYing2rNa/uLeD+JX5BaqRAa1j9HIwUyxKQyAQ1pu97KQY51ciZgFllkBlIKZvQMfQtlTQBE+SLk2fumVVGbqS0LYnuQv05kdPEmGkS2s6EYmxWvbn4n9fPMLoOci7TDEGy5aIoEy4qd/6/O+IaGIqpJZRpbm91WUw1ZWhTqtgQ/NWX/5LORd1v1Bv3l7XmXRFHmZyQU3JOfHJFmuSWtEibMKLIE3khrw46z86b875sLTnFzDH5BefjG72ukZw=</latexit> ✓lh <latexit sha1_base64="J7WNvF3PA4jL5ZLUVK6vJmOFGtg=">AAAB8XicbVBNS8NAEN3Ur1q/qh69LBbBU0lEqseCF09SwX5gG8pmO2mXbjZhdyKU0H/hxYMiXv033vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzaHJYxnrTsAMSKGgiQIldBINLAoktIPxzcxvP4E2IlYPOEnAj9hQiVBwhlZ67OEIkPWzYNovV9yqOwddJV5OKiRHo1/+6g1inkagkEtmTNdzE/QzplFwCdNSLzWQMD5mQ+haqlgExs/mF0/pmVUGNIy1LYV0rv6eyFhkzCQKbGfEcGSWvZn4n9dNMbz2M6GSFEHxxaIwlRRjOnufDoQGjnJiCeNa2FspHzHNONqQSjYEb/nlVdK6qHq1au3+slK/y+MokhNySs6JR65IndySBmkSThR5Jq/kzTHOi/PufCxaC04+c0z+wPn8Aee3kSA=</latexit> ✓b Noised data Denoised data <latexit sha1_base64="QBnySmaVLxh07nmXuJxTPGJFpew=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseCF09S0X5AG8pmu2mXbjZhdyKU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDPz209cGxGrR5wk3I/oUIlQMIpWeuj0sV+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophtd+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVL1atXZ/Wanf5XEU4QRO4Rw8uII63EIDmsBgCM/wCm+OdF6cd+dj0Vpw8plj+APn8wdHqo3a</latexit> Xt <latexit sha1_base64="4JuqMWdwE2Cm3dgZ35SI+2C6v68=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseCF09SwX5IG8pmu2mXbjZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDPz209cGxGrB5wk3I/oUIlQMIpWeuyNKGadad/tlytu1Z2DrBIvJxXI0eiXv3qDmKURV8gkNabruQn6GdUomOTTUi81PKFsTIe8a6miETd+Nj94Ss6sMiBhrG0pJHP190RGI2MmUWA7I4ojs+zNxP+8borhtZ8JlaTIFVssClNJMCaz78lAaM5QTiyhTAt7K2EjqilDm1HJhuAtv7xKWhdVr1at3V9W6nd5HEU4gVM4Bw+uoA630IAmMIjgGV7hzdHOi/PufCxaC04+cwx/4Hz+AK+okGM=</latexit> ˆ X0 Part-aware SA transformer <latexit sha1_base64="EVTFbIN1nggJwBL9u8p32eeAG18=">AAAB9XicbVBNS8NAEN34WetX1aOXYBE8lUSkeix48SQV7Ac0sUy2m2bpZhN2J0oJ/R9ePCji1f/izX/jts1BWx8MPN6bYWZekAqu0XG+rZXVtfWNzdJWeXtnd2+/cnDY1kmmKGvRRCSqG4BmgkvWQo6CdVPFIA4E6wSj66nfeWRK80Te4zhlfgxDyUNOAY304EWAuYcRQ5j0g36l6tScGexl4hakSgo0+5Uvb5DQLGYSqQCte66Top+DQk4Fm5S9TLMU6AiGrGeohJhpP59dPbFPjTKww0SZkmjP1N8TOcRaj+PAdMaAkV70puJ/Xi/D8MrPuUwzZJLOF4WZsDGxpxHYA64YRTE2BKji5labRqCAogmqbEJwF19eJu3zmluv1e8uqo3bIo4SOSYn5Iy45JI0yA1pkhahRJFn8krerCfrxXq3PuatK1Yxc0T+wPr8AfqwkuE=</latexit> ˆ ✓b <latexit sha1_base64="H6zCOvAlsnOx+LHEAOFSI072Fzs=">AAAB+nicbVDLSsNAFJ3UV62vVJduBovgqiQi1WXBjSupYB/QhDCZTpqhkwczN0qJ+RQ3LhRx65e482+ctllo64ELh3Pu5d57/FRwBZb1bVTW1jc2t6rbtZ3dvf0Ds37YU0kmKevSRCRy4BPFBI9ZFzgINkglI5EvWN+fXM/8/gOTiifxPUxT5kZkHPOAUwJa8sy6ExLIHQgZkMLLRVh4ZsNqWnPgVWKXpIFKdDzzyxklNItYDFQQpYa2lYKbEwmcClbUnEyxlNAJGbOhpjGJmHLz+ekFPtXKCAeJ1BUDnqu/J3ISKTWNfN0ZEQjVsjcT//OGGQRXbs7jNAMW08WiIBMYEjzLAY+4ZBTEVBNCJde3YhoSSSjotGo6BHv55VXSO2/arWbr7qLRvi3jqKJjdILOkI0uURvdoA7qIooe0TN6RW/Gk/FivBsfi9aKUc4coT8wPn8AGV6Umg==</latexit> ˆ ✓lh <latexit sha1_base64="nDHWfp52H7g1p0dbT0zKIHWyHsk=">AAAB+nicbVDLSsNAFJ3UV62vVJduBovgqiQi1WXBjSupYB/QhDCZTpqhkwczN0qJ+RQ3LhRx65e482+ctllo64ELh3Pu5d57/FRwBZb1bVTW1jc2t6rbtZ3dvf0Ds37YU0kmKevSRCRy4BPFBI9ZFzgINkglI5EvWN+fXM/8/gOTiifxPUxT5kZkHPOAUwJa8sy6ExLIHQgZkMLLZVh4ZsNqWnPgVWKXpIFKdDzzyxklNItYDFQQpYa2lYKbEwmcClbUnEyxlNAJGbOhpjGJmHLz+ekFPtXKCAeJ1BUDnqu/J3ISKTWNfN0ZEQjVsjcT//OGGQRXbs7jNAMW08WiIBMYEjzLAY+4ZBTEVBNCJde3YhoSSSjotGo6BHv55VXSO2/arWbr7qLRvi3jqKJjdILOkI0uURvdoA7qIooe0TN6RW/Gk/FivBsfi9aKUc4coT8wPn8AIoKUoA==</latexit> ˆ ✓rh <latexit sha1_base64="fUbtIBPB+sePqAkHpTdFHv6kFEs=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEqseCF09SwX5AE8Jmu2mWbjZhd1Ioof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCzPBNTjOt1XZ2Nza3qnu1vb2Dw6P7OOTrk5zRVmHpiJV/ZBoJrhkHeAgWD9TjCShYL1wfDf3exOmNE/lE0wz5idkJHnEKQEjBbbtxQQKD2IGZBYU0Syw607DWQCvE7ckdVSiHdhf3jClecIkUEG0HrhOBn5BFHAq2Kzm5ZplhI7JiA0MlSRh2i8Wl8/whVGGOEqVKQl4of6eKEii9TQJTWdCINar3lz8zxvkEN36BZdZDkzS5aIoFxhSPI8BD7liFMTUEEIVN7diGhNFKJiwaiYEd/XlddK9arjNRvPxut56KOOoojN0ji6Ri25QC92jNuogiiboGb2iN6uwXqx362PZWrHKmVP0B9bnD0a/lCI=</latexit> ˆ ✓f Shape (w/ perturb) Time step <latexit sha1_base64="4HvmFWmzLIDxlZRqKVqILesycXs=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vjw4klasB/QhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrNbBfKrsVdw6ySryclCFHvV/66g1ilkZcIZPUmK7nJuhnVKNgkk+LvdTwhLIxHfKupYpG3PjZ/NApObfKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYY3fiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNkUbgrf88ippXVa8aqXauCrX7vM4CnAKZ3ABHlxDDe6gDk1gwOEZXuHNeXRenHfnY9G65uQzJ/AHzucP51eNDw==</latexit> t SMPL-X mesh Latent embeds SMPL-X layer Part-wise pose + Pose space Pose space Mesh space Latent space <latexit sha1_base64="STRauEmDDxwSADsrrEo2zlkHc6A=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKRI8BL+opAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3cz81hMqzWP5YMYJ+hEdSB5yRo2V6ne9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCa3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpHlR9irlSv2yVL3P4sjDCZzCOXhwBVW4hRo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD6T3jOA=</latexit> I <latexit sha1_base64="14lEyIb1Xcz6amOfoJLrnuibEl4=">AAAC53icbVHLbtNAFJ2YVzGvFJZsRkRISKiRXZXHAlBFo4pFhYogbaU4ssbjm2bUmbHluQ6NRvMN7BBbPoXP4AvYwh8wTgwiDVeydHzOvfa952SlFAaj6HsnuHT5ytVrG9fDGzdv3b7T3bx7ZIq64jDkhSyqk4wZkELDEAVKOCkrYCqTcJyd7TX68QwqIwr9AecljBU71WIiOENPpd1RohhOOZP2wKWDl4n0ozlLE5wCMnrQgsd/+JmnZn/fbIJwjtT6NahzXlol0m4v6keLousgbkGPtHWYbnYwyQteK9DIJTNmFEclji2rUHAJLkxqAyXjZ+wURh5qpsCM7cIFRx96JqeTovKPRrpg/52wTBkzV5nvbG42F7WG/J82qnHyfGyFLmsEzZc/mtSSYkEbS2kuKuAo5x4wXgm/K+VTVjGO3vgwTAbgj6lg32+1z5SQczt01qA4384yZ91Kw/spK2FFV85qR+2LrVd0wW01Cy4Hw+QtfBy0hu0VSjGd20TovInXW2LdMhDbXNOYsv7h2LnQ5xRfTGUdHG3346f9J+92eruv28Q2yH3ygDwiMXlGdskbckiGhJNv5Af5SX4FIvgUfA6+LFuDTjtzj6xU8PU3BQvwaw==</latexit> LD = ωωLω + ωvLv + ωcol Lcol Pose loss Vertex loss Collision loss
  31. Prior fi tting to single-view pose estimation • Given estimated

    2D keypoints and 3D pose, re fi ne 3D pose with denoising 31
  32. Results of single-view pose estimation • Pose metric: MPJPE↓ 32

    Method Avg. Hands Body Face SMPLer-X 
 [Z . Cai+,NeurIPS’23] 58.0 98.7 41.6 38.9 (Fine-tuned) 42.0 56.7 31.9 34.1 + 2D fi tting 41.7 65.7 30.6 31.6 + BUDDI [L. Muller+, CVPR’24] 71.7 99.9 36.3 66.4 + Ours 31.8 54.6 24.7 19.2 SMPLer-X +Ours Goliath-SC eval set
  33. Summary • Construct Goliath-SC dataset for self-contact analysis (with Meta

    at Pittsburgh) • Generative contact prior for self-contact poses • Part-aware pose di ff usion with latent embedding • Prior adaptation to single-view pose estimation • Outperform the fi ne-tuned SOTA regressor 33
  34. 34 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose
  35. 3D hand pose pre-training in the wild • Leverage large-scale

    in-the-wild videos to build a prior for hand perception 1. Propose similar hands mining to create informative positive pairs 2. Propose contrastive learning with adaptive weighting based on similarity 3. Validate the e ff ectiveness of the pre-trained prior on 3D HPE 35 [4] N. Lin*, T. Ohkawa*+, SiMHand: Mining similar hands for 3D hand pose pre-training. In ICLR, 2025
  36. Learning from in-the-wild videos • We have access to massive

    in-the-wild human videos but they are unlabeled • Hand perception prior with contrastive learning from diverse images 36 100DOH from YouTube
 [D. Shan+, CVPR’20] Ego4D: Worldwide fi rst-person footage 
 [K. Grauman+, CVPR’22]
  37. How to assign positive pairs? • SimCLR / PeCLR: Positive

    samples originate from the same instance 
 [T. Chen+, ICML’20, A. Spurr+, ICCV’21] 37
  38. How to assign positive pairs? • SimCLR / PeCLR: Positive

    samples originate from the same instance 
 [T. Chen+, ICML’20, A. Spurr+, ICCV’21] • Time contrastive learning assigns temporally neighboring frames as positives
 [A. Ziani+, 3DV’22] 38 → Limited supervision related to interactions, background, and user identity
  39. • Can samples with similar pose provide a key to

    generalizable prior? Ideal positives 39 1) Di ff erent hand-held objects 2) Di ff erent backgrounds 3) Di ff erent user ID and appearance
  40. Experimental results 42 Method FreiHand DexYCB Assembly
 Hands w/o pre-train

    19.21 19.36 19.17 SimCLR
 [T. Chen+, ICML’20] 20.07 21.09 21.24 PeCLR
 [A. Spurr+, ICCV’21] 18.19 18.06 18.88 Ours
 (Similar hands + weighting) 15.79 16.71 18.23 • Pre-training set: Ego4D + 100DOH (2M images) with processed 2D poses • Fine-tune a pre-trained model on 3D HPE datasets (Metric: MPJPE↓)
  41. Summary • Learning a generalizable prior from in-the-wild similar hands

    • Leverage weak guidance from 2D pose in pre-training • Similar hands mining in the pre-training set construction • Propose contrastive learning from similar hands with adaptive weighting • Demonstrate its e ff ectiveness across di ff erent 3D HPE datasets 44
  42. 45 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose
  43. Domain adaptive hand state estimation • Bridge domain gap to

    label-scarce domains 1. Propose a self-training method to adapt to unlabeled target data 2. Integrate con fi dence estimation into self-training for reliable modeling 3. Improve hand pose and mask estimation in novel domains 46 [5] T. Ohkawa+, Domain adaptive hand keypoint and pixel localization in the wild. In ECCV, 2022 DexYCB (studio)
 [Y-W. Chao+, CVPR’21] Ego4D (in-the-wild)
 [K. Grauman+, CVPR’22] Before After
  44. Transfer learning setup 47 Source data (w/ GTs) Target data

    (w/o GTs) • Large training images • High- fi delity annotation • Limited diversity (e.g., DexYCB [Y-W. Chao+, CVPR’21]) • Diverse images • Di ff erent captured environments • Lack annotation of pose and mask (e.g., Ego4D [K. Grauman+, CVPR’22])
  45. Mean teacher: Stable self-supervised learning • Consistency training (c.f. pseudo-labeling)

    by separate model design 
 [A. Tarvainen+, ICLR’17] • Teacher: averaged from the student / Student: trained via consistency loss 48 Label transformation Teacher Target data Augmented target data Student Data augmentation <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 <latexit sha1_base64="9Dl+jMsL7Ymz+C/uhYGT+Xm96wg=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A==</latexit> pst Backward <latexit sha1_base64="Svw0ciXwlgbE5EpbXshhkqHDJ/o=">AAAC5XichVJNb9QwEHVSPkr42pYjF4vVSuXQVbIqlEulSlwQ4lCkbltps7Icx8ladT5kT0pXXt+5cAAhrvwnbvwXDjibHEpbiZEiP715M2/GTlJLoSEMf3v+xp279+5vPggePnr85Olga/tEV41ifMoqWamzhGouRcmnIEDys1pxWiSSnybnb9v86QVXWlTlMSxrPi9oXopMMAqOIoM/cUFhwag0HywxOWU2GI3wJwL4AE9wLHkGOxHexTHwSzBa5EUlUtvxq1VGol5j4qSSqV4W7jCX1uJYiXwBL11pRib/E61WZNLj/ghGneV6Q6N4ao0byjrl1R416eaiTe667WJjzLF1eyytvcWxtm2H3qC1JINhOA7XgW+CqAdD1McRGfyK04o1BS+BSar1LAprmBuqQDDJbRA3mteUndOczxwsacH13Kx3sHjkmBRnlXJfCXjNXq0wtNDtpE7Zvom+nmvJ23KzBrI3cyPKugFess4oaySGCrdPjlOhOAO5dIAyJdysmC2oogzcjxG4S4iur3wTnEzG0evxq497w8P3/XVsoufoBdpBEdpHh+gdOkJTxLzE++x99b75uf/F/+7/6KS+19c8Q/+E//Mvq/DtCQ==</latexit> Lgac Consistency training EMA <latexit sha1_base64="sWsIaXct8AGXwjPyfKbsfTV3ITk=">AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1</latexit> Ty <latexit sha1_base64="Fdb1SgA7mFyT4deOPFEAVMvMKCU=">AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ=</latexit> Tx Consistency loss <latexit sha1_base64="t/56xOUUTPZtF9QCAzLYRLaIsc8=">AAAC2HicbVFNbxMxEHWWr7J8pXDkYhEhlUOibFQ+DghVtKo4IFRE01Zko5XX602s2t6VPQtdOZa4Ia78FH4NHOGX4E32QBpGsvz0ZsaeeS8tBTcwHP7qBFeuXrt+Y+tmeOv2nbv3utv3T0xRacrGtBCFPkuJYYIrNgYOgp2VmhGZCnaanu83+dNPTBteqGOoSzaVZKZ4zikBTyXdt4tFnBYiM7X0ly2TGNgFWAPO4T621h47l9jauViwHHbsWrFLwOFY89kcniwWySjp9oaD4TLwJoha0ENtHCXbHYizglaSKaCCGDOJhiVMLdHAqWAujCvDSkLPyYxNPFREMjO1y7UdfuyZDOeF9kcBXrL/dlgiTTOqr5QE5uZyriH/l5tUkL+YWq7KCpiiq4/ySmAocKMhzrhmFETtAaGa+1kxnRNNKHilwzA+YH4ZzQ79VIdEclHbsfOS8otRmjrr1go+zEnJ1vLSWeWwfdl/hZdcvxlw1RjG79jng1aw/UJKojIbc5U1fnpJrLMr/5ptGlE2H46cC71P0WVXNsHJaBA9Gzx9v9vbe906toUeokdoB0XoOdpDb9ARGiOKfqCf6Df6E3wMvgRfg2+r0qDT9jxAaxF8/wtP4+yO</latexit> ||pst → Ty (pt ) ||2 Student pred Teacher pred
  46. <latexit sha1_base64="t/56xOUUTPZtF9QCAzLYRLaIsc8=">AAAC2HicbVFNbxMxEHWWr7J8pXDkYhEhlUOibFQ+DghVtKo4IFRE01Zko5XX602s2t6VPQtdOZa4Ia78FH4NHOGX4E32QBpGsvz0ZsaeeS8tBTcwHP7qBFeuXrt+Y+tmeOv2nbv3utv3T0xRacrGtBCFPkuJYYIrNgYOgp2VmhGZCnaanu83+dNPTBteqGOoSzaVZKZ4zikBTyXdt4tFnBYiM7X0ly2TGNgFWAPO4T621h47l9jauViwHHbsWrFLwOFY89kcniwWySjp9oaD4TLwJoha0ENtHCXbHYizglaSKaCCGDOJhiVMLdHAqWAujCvDSkLPyYxNPFREMjO1y7UdfuyZDOeF9kcBXrL/dlgiTTOqr5QE5uZyriH/l5tUkL+YWq7KCpiiq4/ySmAocKMhzrhmFETtAaGa+1kxnRNNKHilwzA+YH4ZzQ79VIdEclHbsfOS8otRmjrr1go+zEnJ1vLSWeWwfdl/hZdcvxlw1RjG79jng1aw/UJKojIbc5U1fnpJrLMr/5ptGlE2H46cC71P0WVXNsHJaBA9Gzx9v9vbe906toUeokdoB0XoOdpDb9ARGiOKfqCf6Df6E3wMvgRfg2+r0qDT9jxAaxF8/wtP4+yO</latexit> ||pst → Ty (pt ) ||2 Consistency loss


    Label transformation Teacher Target data Augmented target data Student Data augmentation <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 <latexit sha1_base64="9Dl+jMsL7Ymz+C/uhYGT+Xm96wg=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A==</latexit> pst Backward <latexit sha1_base64="Svw0ciXwlgbE5EpbXshhkqHDJ/o=">AAAC5XichVJNb9QwEHVSPkr42pYjF4vVSuXQVbIqlEulSlwQ4lCkbltps7Icx8ladT5kT0pXXt+5cAAhrvwnbvwXDjibHEpbiZEiP715M2/GTlJLoSEMf3v+xp279+5vPggePnr85Olga/tEV41ifMoqWamzhGouRcmnIEDys1pxWiSSnybnb9v86QVXWlTlMSxrPi9oXopMMAqOIoM/cUFhwag0HywxOWU2GI3wJwL4AE9wLHkGOxHexTHwSzBa5EUlUtvxq1VGol5j4qSSqV4W7jCX1uJYiXwBL11pRib/E61WZNLj/ghGneV6Q6N4ao0byjrl1R416eaiTe667WJjzLF1eyytvcWxtm2H3qC1JINhOA7XgW+CqAdD1McRGfyK04o1BS+BSar1LAprmBuqQDDJbRA3mteUndOczxwsacH13Kx3sHjkmBRnlXJfCXjNXq0wtNDtpE7Zvom+nmvJ23KzBrI3cyPKugFess4oaySGCrdPjlOhOAO5dIAyJdysmC2oogzcjxG4S4iur3wTnEzG0evxq497w8P3/XVsoufoBdpBEdpHh+gdOkJTxLzE++x99b75uf/F/+7/6KS+19c8Q/+E//Mvq/DtCQ==</latexit> Lgac Consistency training EMA <latexit sha1_base64="sWsIaXct8AGXwjPyfKbsfTV3ITk=">AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1</latexit> Ty <latexit sha1_base64="Fdb1SgA7mFyT4deOPFEAVMvMKCU=">AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ=</latexit> Tx Uncertainty of consistency training • The quality of the supervision a ff ects the training progress 49 Sensitive to its quality! <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1
  47. Con fi dence-aware consistency training • Evaluate con fi dence

    and weight the consistency loss 50 <latexit sha1_base64="xnKJKYLomaFgqq1788VechOAtIk=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiQi6rLgRsFFBfuANoTJdNIOnZmEmYlQQhZu/BU3LhRx60e482+cpFlo64GBM+fcy733BDGjSjvOt1VZWV1b36hu1ra2d3b37P2DrooSiUkHRyyS/QApwqggHU01I/1YEsQDRnrB9Cr3ew9EKhqJez2LicfRWNCQYqSN5Nv1IUd6ghFLbzM/LT6Sp3iMcJb5dsNpOgXgMnFL0gAl2r79NRxFOOFEaMyQUgPXibWXIqkpZiSrDRNFYoSnaEwGhgrEifLS4ogMHhtlBMNImic0LNTfHSniSs14YCrzLdWil4v/eYNEh5deSkWcaCLwfFCYMKgjmCcCR1QSrNnMEIQlNbtCPEESYW1yq5kQ3MWTl0n3tOmeN8/uzhqtmzKOKqiDI3ACXHABWuAatEEHYPAInsEreLOerBfr3fqYl1assucQ/IH1+QMH7ZkF</latexit> Lcgac Label transformation Teacher Target data Augmented target data Student Data augmentation <latexit sha1_base64="9Dl+jMsL7Ymz+C/uhYGT+Xm96wg=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A==</latexit> pst Backward Consistency training EMA <latexit sha1_base64="sWsIaXct8AGXwjPyfKbsfTV3ITk=">AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1</latexit> Ty <latexit sha1_base64="Fdb1SgA7mFyT4deOPFEAVMvMKCU=">AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ=</latexit> Tx <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 Con fi dence-aware 
 consistency loss Con fi dence weight wt <latexit sha1_base64="NbB0JB+TBMrK+AiOc8k3iKXkUJQ=">AAACYnicbVFNT9wwEHUCtDRtYSlHOFiskODAKkFt6RGpl4oTlVhA2qwix5nsWjgfsictkdd/kltPXPghdcIe+BrJ8tN7b8Yz47SWQmMY/vP8ldW1d+/XPwQfP33e2BxsfbnUVaM4jHklK3WdMg1SlDBGgRKuawWsSCVcpTc/O/3qDygtqvIC2xqmBZuVIhecoaOSQRsj3GJfxyjIrPmboA326Vs0XSzitJKZbgt3mTrpTYY1M2vpETXGXFibmNbaWEKOB+aZ23YVYiVmczxcLJLjZDAMR2Ef9DWIlmBIlnGeDO7irOJNASVyybSeRGGNU8MUCi7BBnGjoWb8hs1g4mDJCtBT089g6b5jMppXyp0Sac8+zTCs0F2nzlkwnOuXWke+pU0azH9MjSjrBqHkjw/ljaRY0W7fNBMKOMrWAcaVcL1SPmeKcXS/ErglRC9Hfg0uj0fR99G331+Hp2fLdayTHbJHDkhETsgp+UXOyZhwcu+teRvepvfgB/6Wv/1o9b1lzjZ5Fv7uf5Sku7k=</latexit> wt <latexit sha1_base64="OakD++h2EQ6Ex9E1r1w3p7PsEjI=">AAAC7XicbVFNb9NAEN2YjxbzlcKRy4oIqRwSxRFQDghVtKo4oSKatlIcWev1JFl117Z2x7TRZn8GN8SVn8Jv4EdwhStrJwfSMJLlp5k3szPvpaUUBvv9n63gxs1bt7e274R3791/8LC98+jUFJXmMOSFLPR5ygxIkcMQBUo4LzUwlUo4Sy8O6vrZZ9BGFPkJzksYKzbNxURwhj6VtFmMcIXNHKshc/YyQUcXizgtZGbmyv9smTQka9A52qXW2hPnEjt3LpYwwV27Rnb1gFiL6QyfLxbJIGl3+r1+E3QTRCvQIas4TnZaGGcFrxTkyCUzZhT1SxxbplFwCS6MKwMl4xdsCiMPc6bAjG1zg6PPfCajk0L7L0faZP/tsEyZelXPVAxn5nqtTv6vNqpw8npsRV5WCDlfPjSpJMWC1sLSTGjgKOceMK6F35XyGdOMo5c/DOND8MdoOPJbHTEl5NwOnZdUXA3S1Fm3Rvg0YyWs1ZWzuaP2TfctbXLdesFlYxh/gMvDlWAHhVIsz2ws8qw22UtinV36V19Ti7I5OHIu9D5F113ZBKeDXvSq9/Lji87+u5Vj2+QJeUp2SUT2yD55T47JkHDyg/wiv8mfoAi+BF+Db0tq0Fr1PCZrEXz/C3GT9jU=</latexit> wt ||pst → Ty (pt ) ||2
  48. How to estimate the con fi dence weight? • Use

    the disagreement of two networks with di ff erent parameters 51 Teacher1 Teacher2 Target data <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 <latexit sha1_base64="9J2CvAgqoUDqAZ/vGHh75jpAVl8=">AAAB9XicbVDLSgMxFM3UV62vqks3wSK4KjOlqMuCG91VsA9ox5JJM21okhmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIBbcgOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiKWvRSES6GxDDBFesBRwE68aaERkI1gkm15nfeWTa8EjdwzRmviQjxUNOCVjpIR70JYGxlinUZqVBueJW3TnwKvFyUkE5moPyV38Y0UQyBVQQY3qeG4OfEg2cCjYr9RPDYkInZMR6lioimfHTeeoZPrPKEIeRtk8Bnqu/N1IijZnKwE5mGc2yl4n/eb0Ewis/5SpOgCm6OBQmAkOEswrwkGtGQUwtIVRzmxXTMdGEgi0qK8Fb/vIqadeq3kW1flevNG7zOoroBJ2ic+ShS9RAN6iJWogijZ7RK3pznpwX5935WIwWnHznGP2B8/kDWsmScg==</latexit> pt2 <latexit sha1_base64="jBewgO1f13JlQjYfNGW7pieGo+8=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN7qrYB/QGUomzbShSWZIMkoZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cMOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8SaciZp2zDDaS9RFIuQ0244ucn97iNVmsXywUwTGgg8kixiBBsr+U8DX2AzViIzs0G15tbdOdAq8QpSgwKtQfXLH8YkFVQawrHWfc9NTJBhZRjhdFbxU00TTCZ4RPuWSiyoDrJ55hk6s8oQRbGyTxo0V39vZFhoPRWhncwT6mUvF//z+qmJroOMySQ1VJLFoSjlyMQoLwANmaLE8KklmChmsyIyxgoTY2uq2BK85S+vks5F3busN+4bteZdUUcZTuAUzsGDK2jCLbSgDQQSeIZXeHNS58V5dz4WoyWn2DmGP3A+fwC2lpIp</latexit> wt <latexit sha1_base64="cS4rWYkTZN3b/zXYrLOZxK/+60I=">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJFPVY8KK3CvYDmhA2m0m7dDcJuxuhhHrxr3jxoIhX/4U3/42bNgdtfTDweG+GmXlByqhUtv1tVFZW19Y3qpu1re2d3T1z/6Ark0wQ6JCEJaIfYAmMxtBRVDHopwIwDxj0gvF14fceQEiaxPdqkoLH8TCmESVYack3j1xgzM9djtVI8DykEg8FwHTqm3W7Yc9gLROnJHVUou2bX26YkIxDrAjDUg4cO1VejoWihMG05mYSUkzGeAgDTWPMQXr57IOpdaqV0IoSoStW1kz9PZFjLuWEB7qzuFQueoX4nzfIVHTl5TROMwUxmS+KMmapxCrisEIqgCg20QQTQfWtFhlhgYnSodV0CM7iy8uke95wLhrNu2a9dVvGUUXH6ASdIQddoha6QW3UQQQ9omf0it6MJ+PFeDc+5q0Vo5w5RH9gfP4AtP6XvA==</latexit> `disagree <latexit sha1_base64="4rxQUi55Cr/voOi0nGDjvQRuO9c=">AAACQXicbVBNT9tAEF1DPyD9IJQjl1WjSumhkW1B4VIpgkuPqdRApNiy1ptxssr6Q7tjILL917jwD7hx58KhCPXaSzeJpbZJR1rtm/feaHZfmEmh0bbvrI3NZ89fvNzabrx6/ebtTnP33ZlOc8Whz1OZqkHINEiRQB8FShhkClgcSjgPp6dz/fwClBZp8h1nGfgxGyciEpyhoYLm4DJA+oW61JMQYduhn6iHcIWFFuM4FaNqyZdlFhRLAZ2qMq4/vWv6sgxcT4nxBD/WV9Bs2R17UXQdODVokbp6QfPWG6U8jyFBLpnWQ8fO0C+YQsElVA0v15AxPmVjGBqYsBi0XywSqOgHw4xolCpzEqQL9u+JgsVaz+LQOGOGE72qzcn/acMco2O/EEmWIyR8uSjKJcWUzuOkI6GAo5wZwLgS5q2UT5hiHE3oDROCs/rldXDmdpzPncNvB63uSR3HFtkn70mbOOSIdMlX0iN9wsk1uSc/yKN1Yz1YT9bPpXXDqmf2yD9l/foN7VCwng==</latexit> wt = 2 (1 → sigmoid (||pt1 → pt2 ||2))
  49. Method overview • Student update from the dual-teacher ensemble and

    estimated con fi dence 52 Teacher1 Teacher2 Target data Augmented target data Student Data augmentation Label transformation Backward <latexit sha1_base64="vfvUBVcZnPb6du3lMU3Q9759M8Y=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ==</latexit> pt1 <latexit sha1_base64="9J2CvAgqoUDqAZ/vGHh75jpAVl8=">AAAB9XicbVDLSgMxFM3UV62vqks3wSK4KjOlqMuCG91VsA9ox5JJM21okhmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIBbcgOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiKWvRSES6GxDDBFesBRwE68aaERkI1gkm15nfeWTa8EjdwzRmviQjxUNOCVjpIR70JYGxlinUZqVBueJW3TnwKvFyUkE5moPyV38Y0UQyBVQQY3qeG4OfEg2cCjYr9RPDYkInZMR6lioimfHTeeoZPrPKEIeRtk8Bnqu/N1IijZnKwE5mGc2yl4n/eb0Ewis/5SpOgCm6OBQmAkOEswrwkGtGQUwtIVRzmxXTMdGEgi0qK8Fb/vIqadeq3kW1flevNG7zOoroBJ2ic+ShS9RAN6iJWogijZ7RK3pznpwX5935WIwWnHznGP2B8/kDWsmScg==</latexit> pt2 <latexit sha1_base64="jBewgO1f13JlQjYfNGW7pieGo+8=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN7qrYB/QGUomzbShSWZIMkoZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cMOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8SaciZp2zDDaS9RFIuQ0244ucn97iNVmsXywUwTGgg8kixiBBsr+U8DX2AzViIzs0G15tbdOdAq8QpSgwKtQfXLH8YkFVQawrHWfc9NTJBhZRjhdFbxU00TTCZ4RPuWSiyoDrJ55hk6s8oQRbGyTxo0V39vZFhoPRWhncwT6mUvF//z+qmJroOMySQ1VJLFoSjlyMQoLwANmaLE8KklmChmsyIyxgoTY2uq2BK85S+vks5F3busN+4bteZdUUcZTuAUzsGDK2jCLbSgDQQSeIZXeHNS58V5dz4WoyWn2DmGP3A+fwC2lpIp</latexit> wt <latexit sha1_base64="9Dl+jMsL7Ymz+C/uhYGT+Xm96wg=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A==</latexit> pst <latexit sha1_base64="fwin5RExdw6LrgXvVkk/blWphps=">AAACJ3icbVDLSgMxFM3UV62vqks3wSIIQpkpRd0oBTe6q2Af0JYhk2ba0ExmSDJCCfkbN/6KG0FFdOmfmLazsK0HLhzOuZd77wkSRqVy3W8nt7K6tr6R3yxsbe/s7hX3D5oyTgUmDRyzWLQDJAmjnDQUVYy0E0FQFDDSCkY3E7/1SISkMX9Q44T0IjTgNKQYKSv5xevE190IqaGINOHSGHgFu6FAWP8xlIGeOZsXKsboivGLJbfsTgGXiZeREshQ94tv3X6M04hwhRmSsuO5ieppJBTFjJhCN5UkQXiEBqRjKUcRkT09/dPAE6v0YRgLW1zBqfp3QqNIynEU2M7JoXLRm4j/eZ1UhZc9TXmSKsLxbFGYMqhiOAkN9qkgWLGxJQgLam+FeIhsSspGW7AheIsvL5Nmpeydl6v31VLtLosjD47AMTgFHrgANXAL6qABMHgCL+AdfDjPzqvz6XzNWnNONnMI5uD8/AKfTqem</latexit> pens = pt1 + pt2 2 <latexit sha1_base64="cS4rWYkTZN3b/zXYrLOZxK/+60I=">AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJFPVY8KK3CvYDmhA2m0m7dDcJuxuhhHrxr3jxoIhX/4U3/42bNgdtfTDweG+GmXlByqhUtv1tVFZW19Y3qpu1re2d3T1z/6Ark0wQ6JCEJaIfYAmMxtBRVDHopwIwDxj0gvF14fceQEiaxPdqkoLH8TCmESVYack3j1xgzM9djtVI8DykEg8FwHTqm3W7Yc9gLROnJHVUou2bX26YkIxDrAjDUg4cO1VejoWihMG05mYSUkzGeAgDTWPMQXr57IOpdaqV0IoSoStW1kz9PZFjLuWEB7qzuFQueoX4nzfIVHTl5TROMwUxmS+KMmapxCrisEIqgCg20QQTQfWtFhlhgYnSodV0CM7iy8uke95wLhrNu2a9dVvGUUXH6ASdIQddoha6QW3UQQQ9omf0it6MJ+PFeDc+5q0Vo5w5RH9gfP4AtP6XvA==</latexit> `disagree <latexit sha1_base64="xnKJKYLomaFgqq1788VechOAtIk=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiQi6rLgRsFFBfuANoTJdNIOnZmEmYlQQhZu/BU3LhRx60e482+cpFlo64GBM+fcy733BDGjSjvOt1VZWV1b36hu1ra2d3b37P2DrooSiUkHRyyS/QApwqggHU01I/1YEsQDRnrB9Cr3ew9EKhqJez2LicfRWNCQYqSN5Nv1IUd6ghFLbzM/LT6Sp3iMcJb5dsNpOgXgMnFL0gAl2r79NRxFOOFEaMyQUgPXibWXIqkpZiSrDRNFYoSnaEwGhgrEifLS4ogMHhtlBMNImic0LNTfHSniSs14YCrzLdWil4v/eYNEh5deSkWcaCLwfFCYMKgjmCcCR1QSrNnMEIQlNbtCPEESYW1yq5kQ3MWTl0n3tOmeN8/uzhqtmzKOKqiDI3ACXHABWuAatEEHYPAInsEreLOerBfr3fqYl1assucQ/IH1+QMH7ZkF</latexit> Lcgac Consistency training <latexit sha1_base64="sWsIaXct8AGXwjPyfKbsfTV3ITk=">AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1</latexit> Ty <latexit sha1_base64="Fdb1SgA7mFyT4deOPFEAVMvMKCU=">AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ=</latexit> Tx Con fi dence-aware 
 consistency loss <latexit sha1_base64="OakD++h2EQ6Ex9E1r1w3p7PsEjI=">AAAC7XicbVFNb9NAEN2YjxbzlcKRy4oIqRwSxRFQDghVtKo4oSKatlIcWev1JFl117Z2x7TRZn8GN8SVn8Jv4EdwhStrJwfSMJLlp5k3szPvpaUUBvv9n63gxs1bt7e274R3791/8LC98+jUFJXmMOSFLPR5ygxIkcMQBUo4LzUwlUo4Sy8O6vrZZ9BGFPkJzksYKzbNxURwhj6VtFmMcIXNHKshc/YyQUcXizgtZGbmyv9smTQka9A52qXW2hPnEjt3LpYwwV27Rnb1gFiL6QyfLxbJIGl3+r1+E3QTRCvQIas4TnZaGGcFrxTkyCUzZhT1SxxbplFwCS6MKwMl4xdsCiMPc6bAjG1zg6PPfCajk0L7L0faZP/tsEyZelXPVAxn5nqtTv6vNqpw8npsRV5WCDlfPjSpJMWC1sLSTGjgKOceMK6F35XyGdOMo5c/DOND8MdoOPJbHTEl5NwOnZdUXA3S1Fm3Rvg0YyWs1ZWzuaP2TfctbXLdesFlYxh/gMvDlWAHhVIsz2ws8qw22UtinV36V19Ti7I5OHIu9D5F113ZBKeDXvSq9/Lji87+u5Vj2+QJeUp2SUT2yD55T47JkHDyg/wiv8mfoAi+BF+Db0tq0Fr1PCZrEXz/C3GT9jU=</latexit> wt ||pst → Ty (pt ) ||2
  50. Quantitative results 54 DexYCB → HO3D DexYCB → HanCo DexYCB

    → FPHA Method PCK↑ IoU↑ Avg. PCK↑ IoU↑ Avg. PCK↑ IoU↑ Avg. Source only 33.5 49.1 41.3 27.3 41.4 34.3 14.0 24.8 19.4 DANN [Y. Ganin+, ICML’15] 46.8 54.7 50.7 33.0 56.9 45.0 24.4 28.4 26.4 RegDA [J. Jiang+, CVPR’21] 48.2 55.3 51.7 33.6 58.4 46.0 23.7 41.7 32.7 GAC 47.4 56.9 52.1 37.1 58.8 47.9 37.2 33.3 35.3 UMA [M. Cai+, CVPR’20] 45.3 55.0 50.2 35.6 57.7 46.6 36.8 39.3 38.0 Mean-Teacher [A. Tarvainen+, ICLR’17] 44.4 52.3 48.3 33.8 55.1 44.4 31.3 38.4 34.9 Ours (CGAC) 51.1 60.3 55.7 39.9 58.6 49.2 37.2 37.7 37.4 • Pose metric: PCK, Segmentation metric: IoU HO3D HanCo FPHA
  51. Summary • Propose a self-training domain adaptation method for estimating

    hand poses and pixel masks • Consistency training under data augmentation • Con fi dence estimation by using two networks • Teacher-student update • Inspired by my master thesis on consensus pseudo-labeling [T. Ohkawa+, IEEE Access'21] • Patent Submitted! • Work done during research stay at CMU! 55
  52. 56 Robust modeling for fi ne details Data foundation Connecting

    geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) Chapter 7: Video language description from hand tracklets (WACV’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose
  53. Video captioning for hand activities • Describe the procedure of

    egocentric activities with natural language 1. Create a new dataset for egocentric dense video captioning 2. Video captioning model from hand-object tracklets 3. Cross-view transfer learning from web instructional videos 57 [6] T. Ohkawa+, Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. In WACV, 2025
  54. Data collection • New EgoYC2 paired with the YouCook2 (YC2)

    • Re-record 11% of YC2 recipes from 44 users (~43h) • Follow YC2’s vocabulary list and caption granularity 58 YouCook2 (176h of web videos) 
 [L. Zhou+, AAAI’18] • De fi ne video captioning tasks to enhance the expressiveness of actions
 (vs EPIC-KITCHENS [D. Damen+, IJCV’22] / FPHA [G. Garcia-Hernando+, CVPR’18])
  55. Challenges • Egocentric view is always shifting 59 • Web

    videos are rich sources but involve view gap with various cuts → Identity interaction areas with hand-object tracking → Cross-view transfer learning to ego videos while fi nding ego-like views
  56. Hand-object tracklets for video representation 60 Hand detection
 (Coarse but

    temporally coherent) Hand-object segmentation
 (Fine but frame-by-frame)
  57. Exo-to-Ego transfer learning • Separate source views to exo and

    ego-like views • Gradual adaption from exo to ego-like and fi nally to ego view 61 Ego view (EgoYC2) Exo view (YC2) Ego-like view (YC2) Decompose
  58. Feature processing 62 Source view decomposition with face detection Web

    videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection Video features Positive Negative <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> Hands+objects <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> View Labeling HO Tracking Hand-object tracking 
 to represent ego videos
  59. Feature processing 63 Source view decomposition with face detection Web

    videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection Video features Positive Negative <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> Hands+objects <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> View Labeling HO Tracking Hand-object tracking 
 to represent ego videos Hands 1st Obj. 2nd Obj. <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit>
  60. View-invariant learning 64 Source view decomposition with face detection Web

    videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection View-invariant video features Feature converter GRL View classi fi er View labels Segment Caption One-stage captioning Positive Negative <latexit sha1_base64="nTKRBcTzXxrn5dGZsJqR7MTmF1E=">AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjgFBPCZoFkiG0NOpSdr09AzdPUIYAt69eFDEq5/kzb+xsxw08UHB470qquoFieDauO63s7S8srq2ntvIb25t7+wW9vbrOk4VwxqLRayaAdUouMSa4UZgM1FIo0BgIxhcj/3GIyrNY3lvhgn6Ee1JHnJGjZWqN51C0S25E5BF4s1IEWaodApf7W7M0gilYYJq3fLcxPgZVYYzgaN8O9WYUDagPWxZKmmE2s8mh47IsVW6JIyVLWnIRP09kdFI62EU2M6Imr6e98bif14rNeGVn3GZpAYlmy4KU0FMTMZfky5XyIwYWkKZ4vZWwvpUUWZsNnkbgjf/8iKpn5a8i9J59axYvnuaxpGDQziCE/DgEspwCxWoAQOEZ3iFN+fBeXHenY9p65Izi/AA/sD5/AHIzo1l</latexit> F <latexit sha1_base64="k+7Cn3F62iztZW4oIYx+n4RyGOk=">AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjgEPekzQLJAMoadTk7Tp6Rm6e4QwBLx78aCIVz/Jm39jZzlo4oOCx3tVVNULEsG1cd1vZ2l5ZXVtPbeR39za3tkt7O3XdZwqhjUWi1g1A6pRcIk1w43AZqKQRoHARjC4HvuNR1Sax/LeDBP0I9qTPOSMGitVbzqFoltyJyCLxJuRIsxQ6RS+2t2YpRFKwwTVuuW5ifEzqgxnAkf5dqoxoWxAe9iyVNIItZ9NDh2RY6t0SRgrW9KQifp7IqOR1sMosJ0RNX09743F/7xWasIrP+MySQ1KNl0UpoKYmIy/Jl2ukBkxtIQyxe2thPWposzYbPI2BG/+5UVSPy15F6Xz6lmxfPc0jSMHh3AEJ+DBJZThFipQAwYIz/AKb86D8+K8Ox/T1iVnFuEB/IHz+QPKUo1m</latexit> G <latexit sha1_base64="AfVC0asALqG7J5sbcpYQfdvSsIo=">AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjoFcPCZoFkiG0NOpSdr09AzdPUIYAt69eFDEq5/kzb+xsxw08UHB470qquoFieDauO63s7K6tr6xmdvKb+/s7u0XDg4bOk4VwzqLRaxaAdUouMS64UZgK1FIo0BgMxhWJn7zEZXmsbw3owT9iPYlDzmjxkq1SrdQdEvuFGSZeHNShDmq3cJXpxezNEJpmKBatz03MX5GleFM4DjfSTUmlA1pH9uWShqh9rPpoWNyapUeCWNlSxoyVX9PZDTSehQFtjOiZqAXvYn4n9dOTXjjZ1wmqUHJZovCVBATk8nXpMcVMiNGllCmuL2VsAFVlBmbTd6G4C2+vEwa5yXvqnRZuyiW755mceTgGE7gDDy4hjLcQhXqwADhGV7hzXlwXpx352PWuuLMIzyCP3A+fwDEQo1i</latexit> C <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> Hands+objects <latexit sha1_base64="CSyXaWxx6UzyxtLAdkiwM8Z5mag=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J</latexit> View-Invariant (VI) Learning View Labeling HO Tracking Hand-object tracking 
 to represent ego videos Uni fi ed view-invariant learning with adversarial adaptation (exo→ego-like→ego) Video features
  61. Qualitative results • Base (w/o VI): Pre-trained on YC2 and

    fi ne-tuned on EgoYC2 without view-invariant learning • Ours (w/ VI): Base with view-invariant learning on both pre-training and fi ne-tuning stages 65 Less irrelevant ingredients
  62. Quantitative results 66 dvc_eval SODA Method Ego video feat. BLEU4↑

    METEOR↑ CIDEr↑ METEOR↑ CIDEr↑ tIoU↑ Source only Raw 0.00 0.77 3.60 0.89 1.47 17.9 Base (w/o VI) Raw 1.54 7.03 38.1 7.03 25.2 50.5 Det 1.97 8.20 46.3 8.04 32.3 55.0 Det+Seg 1.68 8.91 52.5 8.91 37.3 59.0 +MMD 
 [E. Tzeng+, arXiv’14] Det+Seg 1.74 8.86 50.9 8.86 37.5 58.8 +DANN [Y. Ganin+, ICML’15] Det+Seg 2.05 9.01 53.1 8.97 39.1 58.6 Ours (w/ VI) Det+Seg 2.66 9.19 59.0 9.27 45.2 58.1 EgoYC2 eval set • Metric: BLEU, METEOR (text eval), CIDEr (visual-text eval), tIoU (segment eval) Ego input representation Feature alignment 
 (source → target) Proposed adaptation
 (exo→ego-like→ego)
  63. Summary • Describe egocentric activities in natural language • Egocentric

    video captioning dataset (EgoYC2) aligned to web videos YC2 • Hybrid video representation from hand detection and segmentation • Cross-view knowledge transfer with gradual adaptation • Collaborated with OMRON SINIC X 67
  64. Thesis recap • The thesis establishes foundational understanding of hand

    interactions with precise tracking and interpretation • Highlight contact scenarios (i.e., object & self-contact) with dataset construction • Propose generalizable and adaptable methods that capture fi ne details
 (i.e., contrastive learning, generative prior, and self-training for adaptation) • Link geometry cues to semantic understanding with video captioning 68
  65. Remaining issues A. Data: What’s the next frontier beyond current

    data capture scenarios? B. Modeling: What properties are lacking in current inference models? C. External Prior: Can we leverage strong cues from recent large models? D. Applications: What new capabilities can emerging models unlock beyond current tasks? 69
  66. Future work i. Expanding data acquisition, sensors, and captured scenarios

    ii. Modeling for temporal context, human modalities, and real-time inference iii. Leveraging generative, foundation, and world models iv. Integrate with common-sense knowledge and reasoning v. Towards social and collaborative interactions vi. Physics-based simulation 70 C. External Prior D. Applications B. Modeling A. Data C. External Prior D. Applications
  67. Publications covered by the thesis (*co- fi rst authors) 1.

    T. Ohkawa, R. Furuta, and Y. Sato. 
 E ffi cient annotation and learning for 3D hand pose estimation: A survey. IJCV, 2023 2. Z. Fan*, T. Ohkawa*, L. Yang*, ... (20 authors), A. Yao. 
 Benchmarks and challenges in pose estimation for egocentric hand interactions with objects. In ECCV, 2024 3. T. Ohkawa, J. Lee, S. Saito, J. Saragih, F. Prada, Y. Xu, S. Yu, R. Furuta, Y. Sato, and T. Shiratori. 
 Generative modeling of shape-dependent self-contact human poses. In ICCV, 2025 4. N. Lin*, T. Ohkawa*, M. Zhang, Y. Huang, R. Furuta, and Y. Sato. 
 SiMHand: Mining of similar hands for large-scale 3D hand pose pre-training. In ICLR, 2025 5. T. Ohkawa, Y.-J. Li, Q. Fu, R. Furuta, K. M. Kitani, and Y. Sato. 
 Domain adaptive hand keypoint and pixel localization in the wild. In ECCV, 2022 6. T. Ohkawa, T. Yagi, T. Nishimura, R. Furuta, A. Hashimoto, Y. Ushiku, and Y. Sato. 
 Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. In WACV, 2025
 aa Related publications not covered by the thesis A. T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, and C. Keskin. 
 AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In CVPR, 2023 B. T. Banno, T. Ohkawa, R. Liu, R. Furuta, and Y. Sato. 
 AssemblyHands-X: 3D hand-body co-registration for understanding bi-manual human activities. In MIRU, 2025 C. R. Liu, T. Ohkawa, M. Zhang, and Y. Sato. 
 Single-to-dual-view adaptation for egocentric 3D hand pose estimation. In CVPR, 2024 D. T. Ohkawa, T. Yagi, A. Hashimoto, Y. Ushiku, and Y. Sato. 
 Foreground-aware stylization and consensus pseudo-labeling for domain adaptation of fi rst-person hand segmentation. IEEE Access, 2021
 aa Awards / Fellowships • CVPR EgoVis Distinguished Paper Award’25, Google PhD Fellowship’24, ETH Zurich Leading House Asia’23, 
 MSRA D-CORE’23, JSPS DC1’22, JST ACT-X (’20-’22, Accel.’23) 71