Open-Vocabularyオブジェクト検出

AI Apr. 2025 GO株式会社　本多　浩大 Open-Vocabulary オブジェクト検出概要

AI 2 Vision-Language (VL) モデルが色々出てきているが、 Object Detectionに限定すると、どのようなタスク・手法が存在するか？基本的にTPAMIサーベイ論文9&repo に即して説明します
今日の話題

AI 3 データセットは大規模化しているが、クラスが既定でありclosed-vocabulary の制限がある Object Detection 例：Open Imagesモデルデータセット
category数 train画像数 Pascal VOC 20 1.5k COCO 80 118k LVIS 1203 100k Object365 365 608k OpenImages 600 1.7M *ここからの画像は私物です

AI 4 規定の80クラス [‘car’, ‘person’, …] 通常のObject Detection Detector (x,
y, w, h, conf, class=0)

AI 5 学習画像やcaptionに含まれるが Bounding boxアノテーションに存在しないクラス Open-Vocabulary Object Detection Detector
(x, y, w, h, conf, ‘car’) (x, y, w, h, conf, ‘traﬃc sign’) captionやVLMを使って、既定クラス名以外の言語知識を用いればvision-languageのアライメントが可能 “A car is parked beside the traﬃc signs”

AI 6 Captionデータのvocabulary データセット category数 train画像数 Pascal VOC 20 1.5k
COCO 80 118k LVIS 1203 100k Object365 365 608k OpenImages 600 1.7M Flickr30k 44518 phrases 31k VG caption 110689 phrases 108k captionのvocabularyは桁違いに大きい

AI Caption, VLMが持つ vocabulary 7 ベースクラスとターゲットクラスをcaptionやVLMで橋渡しするイメージ学習データに含まれるベースクラス
未知のターゲットクラス

AI 8 推論例 - GLIP3 pretrained on Objects365 ベースクラスにない単語でも、概念が似ていれば物体検出できる
prompt text: 'red light . road sign . vehicle '

AI 9 推論例 - GLIP3 pretrained on mixed datasets (incl.
captions) 'people . building . road sign . red light . vehicle . skyscraper '

AI 10 推論例 - GLIP3 pretrained on mixed datasets (incl.
captions) 'traffic light . arrow . vehicle . traffic sign . house . tree'

AI 11 boxクラス分類対象が、規定の選択肢から言語表現空間となる重要なポイント 0: car 1: person 2:
cow . . closed-vocabulary open-vocabulary car vehicle, toyota animal cow ox horse bbox bbox

AI 12 “Open-vocabulary” をタイトルに含む論文数本多調べ CLIP, Dall-E

AI 13 手法一覧 (OVD) category citation 25/3/30 affiliation Venue arxiv
OVR-CNN8 Region-Aware Training 469 Snap CVPR21 https://arxiv.org/abs/2011.10678 MDETR2 Region-Aware Training 949 Meta ICCV21 https://arxiv.org/abs/2104.12763 OV-DETR Region-Aware Training 227 Univ. Nanyang ECCV22 https://arxiv.org/abs/2203.11876 PB-OVD Pseudo-Labeling 94 Salesforce ECCV22 https://arxiv.org/abs/2111.09452 Detic Pseudo-Labeling 668 Meta ECCV22 https://arxiv.org/abs/2201.02605 OwlViT6 Transfer Learning 556 Google ECCV22 https://arxiv.org/abs/2205.06230 ViLD1 Knowledge Distillation 1033 Google ICLR22 https://arxiv.org/abs/2104.13921 DetPro Knowledge Distillation 380 Univ. Tsinghua CVPR22 https://arxiv.org/abs/2203.14940 GLIP3 Pseudo-Labeling 1239 Microsoft CVPR22 https://arxiv.org/abs/2112.03857 GLIPv2 Pseudo-Labeling 317 Microsoft NeurIPS22 https://arxiv.org/abs/2206.05836 GroundingDINO5 Pseudo-Labeling 1879 IDEA arxiv23 https://arxiv.org/abs/2303.05499 VLDet4 Region-Aware Training 104 ByteDance ICLR23 https://arxiv.org/abs/2211.14843 F-VLM Transfer Learning 216 Google ICLR23 https://arxiv.org/abs/2209.15639 BARON Knowledge Distillation 124 Univ. Nanyang CVPR23 https://arxiv.org/abs/2302.13996 YOLO-world Region-Aware Training 222 Tencent CVPR24 https://arxiv.org/abs/2401.17270

AI 14 (Closed-vocab) Object Detection (e.g. RetinaNet, YOLO) Detector (x,
y, w, h, conf, class) Visual features Bounding boxes cls head box head Languageモデルを使わないため、クラス名は言語的意味を持たず、Nクラスに分類するのみ

AI 15 type A & B. Region-Aware Training, Pseudo-Labeling 15
Detector phrases Bounding boxes “white car . traﬃc sign” Language Encoder ❄ box head vision features language features ・vision & languageのアライメントにより画像領域と言語を紐づける・自身をteacherとして大量画像にpseudo bboxをつける場合もある (e.g. MDETR2, GLIP3) e.g. BERT, CLIP

AI 16 type C. Knowledge Distillation (e.g. ViLD1) vision features
❄ VLM Image Encoder distillation Detector phrases Bounding boxes “white car . traﬃc sign” Language Encoder ❄ box head (cropped) vision features language features ROI crop ・学習済みVLMでknowledge distillation ・2-stage検出 e.g. CLIP, ALIGN..

AI 17 type D. Transfer Learning (e.g. Owl-ViT6) VLM Image
Encoder phrases Bounding boxes “white car . traﬃc sign” Language Encoder ❄ box head vision features language features VLM (e.g. CLIP)をbackboneとしてﬁnetune、 detector headをつける e.g. CLIP e.g. CLIP (e.g. DETR encoder)

AI 18 どうやって画像領域とlanguageを紐づけるかフォーカス：type A Region-Aware Training

AI 19 A-1: 画像レベルでの紐付け (e.g. OVR-CNN8) Detector “A truck is
going through a yellow light” Language Encoder ❄ feature map vision features language features 画像レベルでweakly superviseする統計的に関連領域が緩く紐づけられていく “truck” “yellow light”

AI 20 phraseと領域をマッチングしながら学習する (DETRの学習方法に似ている) 自動的に似ている画像・言語の概念が学習される A-2: 領域とphraseのマッチング (e.g. VLDet4) “A
truck is going through a yellow light” Language Encoder ❄ “truck” “yellow light” Hungarian matching

AI 21 GoldGデータセット2：Visual GenomeやFlickerデータセットを統合、アノテーションレベルでbboxにphraseを確実に紐づけるアノテーション例 'caption': 'a silver car
on the road. second car in line waiting at traﬃc light. corner of rounded building. Are there any buses on the street? (…) darkness on night sky.', box 0: {'area': 13125.0, 'iscrowd': 0, 'image_id': 486491, 'category_id': 1, 'id': 2497354, 'bbox': [182, 132, 175, 75], 'tokens_positive': [[0, 1], [2, 8], [9, 12]]}, box 1: {'area': 37975.0, 'iscrowd': 0, 'image_id': 486491, 'category_id': 1, 'id': 2497355, 'bbox': [344, 7, 155, 245], 'tokens_positive': [[26, 32], [33, 36]]}, … GLIPやGrounding DINO, YOLO-worldなどで使われている A-3: Visual Grounding Datasetの利用参考：GoldG dataset preparation

AI 22 A-3: loss例 (GLIP3) tree . white car .
→HW focal loss tokenまたはﬂattened feature map →N dot product matrix →HW 言語のtoken “tree . white car .”

AI 23 model ZOOの見方例：mmdet3.3 [link] grounding DINO5 学習に使ったデータセット
普通の detectionデータセット captionを boxに紐付けたもの GLIPでboxを pseudo-labelingしたものターゲット(COCO)で ﬁnetuneしていない ﬁnetuneした pretrainしていない

AI 24 - Open-vocabulary object detectionは、boxのクラス分類に言語的意味を持たせる物体検出分野 - Captionや学習済みVLMの知識を活用する -
言語と画像領域の紐付けが重要。紐付け済みデータセットを利用する場合もある - 今後の方向性：マルチモーダルLLM (GPT, Qwen…)、Foundation model (DINO, CLIP, SAM, diﬀusion models) の活用が続く - 実応用ではどのような使い道があるだろうか？？ Summary

AI 25 References

AI 26 References

Open-Vocabularyオブジェクト検出

Open-Vocabularyオブジェクト検出

GO Inc. dev

More Decks by GO Inc. dev

Other Decks in Technology

Featured

Transcript

AI Apr. 2025 GO株式会社　本多　浩大 Open-Vocabulary オブジェクト検出概要

AI 2 Vision-Language (VL) モデルが色々出てきているが、 Object Detectionに限定すると、どのようなタスク・手法が存在するか？基本的にTPAMIサーベイ論文9&repo に即して説明します

AI 3 データセットは大規模化しているが、クラスが既定でありclosed-vocabulary の制限がある Object Detection 例：Open Imagesモデルデータセット

AI 4 規定の80クラス [‘car’, ‘person’, …] 通常のObject Detection Detector (x,

AI 5 学習画像やcaptionに含まれるが Bounding boxアノテーションに存在しないクラス Open-Vocabulary Object Detection Detector

AI 6 Captionデータのvocabulary データセット category数 train画像数 Pascal VOC 20 1.5k

AI Caption, VLMが持つ vocabulary 7 ベースクラスとターゲットクラスをcaptionやVLMで橋渡しするイメージ学習データに含まれるベースクラス

AI 8 推論例 - GLIP3 pretrained on Objects365 ベースクラスにない単語でも、概念が似ていれば物体検出できる

AI 9 推論例 - GLIP3 pretrained on mixed datasets (incl.

AI 10 推論例 - GLIP3 pretrained on mixed datasets (incl.

AI 11 boxクラス分類対象が、規定の選択肢から言語表現空間となる重要なポイント 0: car 1: person 2:

AI 12 “Open-vocabulary” をタイトルに含む論文数本多調べ CLIP, Dall-E

AI 13 手法一覧 (OVD) category citation 25/3/30 affiliation Venue arxiv

AI 14 (Closed-vocab) Object Detection (e.g. RetinaNet, YOLO) Detector (x,

AI 15 type A & B. Region-Aware Training, Pseudo-Labeling 15

AI 16 type C. Knowledge Distillation (e.g. ViLD1) vision features

AI 17 type D. Transfer Learning (e.g. Owl-ViT6) VLM Image

AI 18 どうやって画像領域とlanguageを紐づけるかフォーカス：type A Region-Aware Training

AI 19 A-1: 画像レベルでの紐付け (e.g. OVR-CNN8) Detector “A truck is

AI 20 phraseと領域をマッチングしながら学習する (DETRの学習方法に似ている) 自動的に似ている画像・言語の概念が学習される A-2: 領域とphraseのマッチング (e.g. VLDet4) “A

AI 21 GoldGデータセット2：Visual GenomeやFlickerデータセットを統合、アノテーションレベルでbboxにphraseを確実に紐づけるアノテーション例 'caption': 'a silver car

AI 22 A-3: loss例 (GLIP3) tree . white car .

AI 23 model ZOOの見方例：mmdet3.3 [link] grounding DINO5 学習に使ったデータセット

AI 24 - Open-vocabulary object detectionは、boxのクラス分類に言語的意味を持たせる物体検出分野 - Captionや学習済みVLMの知識を活用する -

AI 25 References

AI 26 References