Slide ICCV2023 Constructing Image Text Pair Dataset from Books

Constructing Image Text Pair Dataset from Books Yamato OKAMOTO (NAVER
Cloud Corp., WORKS MOBILE JAPAN Corp.) DataComp Workshop Towards the Next Generation of Computer Vision Datasets October 3, 2023 - ICCV, Paris Haruto Toyonaga (Doshisha University) Yoshihisa Ijiri (LINE Corporation.) Hirokatsu Kataoka (LINE Corporation.)

Motivation: Utilize Digital Archiving • Books record historical, cultural, and
customary activities. • To protect these valuable books, digital archiving is is now widely expanding. • As the next step of archiving, we should discuss exploiting them. 2

Purpose: To make AI acquire knowledge from books • digitally
archived books can be considered as multi-modal data. • As novel way to utilize digital archives, we constructed an image-text pair dataset autonomously from them. • We then trained machine learning models on this dataset to acquire knowledge from books, just like humans read books. 3

Developed: Dataset Construction Pipeline 1. OCR (detect and recognize text)
2. Layout Analyzer (to extract only text of caption) 3. Object Detection (detect illustration areas) 4. Matching nearest-neighbor pair (※ Each model had trained on annotated book-image dataset.) 4

Experiments: Dataset Construction • Applied our pipeline to old Japanese
photo books. ü From the period of 1868 to 1945 ü 175 photo books (containing a total of 12640 book images). ü Photographs of locations or buildings from almost every prefecture in Japan. • Ultimately, we obtained 9516 image-text pairs. 5

Experiments: Image-Text Retrieval Setting • We constructed a cross-modal retrieval
system using CLIP. • Using ViT–B/32 for initialization, and we trained CLIP on the constructed dataset. Result • Training enhanced its retrieval performance, especially in the old Japanese domain. • This suggests that digital archives provides CLIP with new domain-specific knowledge. • The trained CLIP retrieved items based on specific Japanese location or building names. 6

Experiments: Insight Extraction Setting • Trained a city classification model
on the constructed dataset. • Analyzing the model provides us with new insights. Result • t-SNE visualization told us, which cities are unique and which are similar. • Grad-CAM Visualization told us, which elements likely represent city identities. 7

Conclusion • We proposed a new approach for leveraging digital
archives by creating an image-text pair dataset. • We demonstrated the effectiveness of model training on this dataset. • This is the first step to realizing machine learning to acquire knowledge autonomously, just like humans read books. 8 All book images presented in this document are reproduced from the NDL-DocL dataset. https://github.com/ndl-lab/layout-dataset

Slide ICCV2023 Constructing Image Text Pair Dat...

Slide ICCV2023 Constructing Image Text Pair Dataset from Books

Yamato.OKAMOTO

More Decks by Yamato.OKAMOTO

Other Decks in Technology

Featured

Transcript

Constructing Image Text Pair Dataset from Books Yamato OKAMOTO (NAVER

Motivation: Utilize Digital Archiving • Books record historical, cultural, and

Purpose: To make AI acquire knowledge from books • digitally

Developed: Dataset Construction Pipeline 1. OCR (detect and recognize text)

Experiments: Dataset Construction • Applied our pipeline to old Japanese

Experiments: Image-Text Retrieval Setting • We constructed a cross-modal retrieval

Experiments: Insight Extraction Setting • Trained a city classification model

Conclusion • We proposed a new approach for leveraging digital