$30 off During Our Annual Pro Sale. View Details »

Slide ICCV2023 Constructing Image Text Pair Dat...

Slide ICCV2023 Constructing Image Text Pair Dataset from Books

- Constructing Image Text Pair Dataset from Books
- DataComp Workshop
- Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris
- Yamato OKAMOTO

Yamato.OKAMOTO

October 03, 2023
Tweet

More Decks by Yamato.OKAMOTO

Other Decks in Technology

Transcript

  1. Constructing Image Text Pair Dataset from Books Yamato OKAMOTO (NAVER

    Cloud Corp., WORKS MOBILE JAPAN Corp.) DataComp Workshop Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris Haruto Toyonaga (Doshisha University) Yoshihisa Ijiri (LINE Corporation.) Hirokatsu Kataoka (LINE Corporation.)
  2. Motivation: Utilize Digital Archiving • Books record historical, cultural, and

    customary activities. • To protect these valuable books, digital archiving is is now widely expanding. • As the next step of archiving, we should discuss exploiting them. 2
  3. Purpose: To make AI acquire knowledge from books • digitally

    archived books can be considered as multi-modal data. • As novel way to utilize digital archives, we constructed an image-text pair dataset autonomously from them. • We then trained machine learning models on this dataset to acquire knowledge from books, just like humans read books. 3
  4. Developed: Dataset Construction Pipeline 1. OCR (detect and recognize text)

    2. Layout Analyzer (to extract only text of caption) 3. Object Detection (detect illustration areas) 4. Matching nearest-neighbor pair (※ Each model had trained on annotated book-image dataset.) 4
  5. Experiments: Dataset Construction • Applied our pipeline to old Japanese

    photo books. ü From the period of 1868 to 1945 ü 175 photo books (containing a total of 12640 book images). ü Photographs of locations or buildings from almost every prefecture in Japan. • Ultimately, we obtained 9516 image-text pairs. 5
  6. Experiments: Image-Text Retrieval Setting • We constructed a cross-modal retrieval

    system using CLIP. • Using ViT–B/32 for initialization, and we trained CLIP on the constructed dataset. Result • Training enhanced its retrieval performance, especially in the old Japanese domain. • This suggests that digital archives provides CLIP with new domain-specific knowledge. • The trained CLIP retrieved items based on specific Japanese location or building names. 6
  7. Experiments: Insight Extraction Setting • Trained a city classification model

    on the constructed dataset. • Analyzing the model provides us with new insights. Result • t-SNE visualization told us, which cities are unique and which are similar. • Grad-CAM Visualization told us, which elements likely represent city identities. 7
  8. Conclusion • We proposed a new approach for leveraging digital

    archives by creating an image-text pair dataset. • We demonstrated the effectiveness of model training on this dataset. • This is the first step to realizing machine learning to acquire knowledge autonomously, just like humans read books. 8 All book images presented in this document are reproduced from the NDL-DocL dataset. https://github.com/ndl-lab/layout-dataset