Donut + SegFormer ~ Enhancing Donut for position prediction

Donut + SegFormer Donut で位置を推定する Enhancing Donut for position prediction

• Joined as an ML Engineer in 2022-03  • Kaggle
Competitions Master  • https://kaggle.com/marisakamozz  MORI Masakazu

• Introduction  • Proposed Method  • Experiment Settings  • Experiment
Results  • Conclusion  • Appendix    • はじめに  • 提案手法  • 実験の設定  • 実験の結果  • 結論  • 補足  アジェンダ AGENDA

はじめに Introduction

請求書などの文書から請求金額などの情報を読み取るタスク。様々な様式が存在するため、特定のルールで読み取ることはできない。  A task to extract information such
as billing amounts from documents like invoices. Since there are various formats, it is not possible to read the information using speciﬁc set of rules.  AI-OCRとは？ What is AI-OCR? The above image is cited from “DocILE Benchmark for Document Information Localization and Extraction” authored by Štěpán Šimsa et al on 3 May 2023. https://arxiv.org/abs/2302.05658 (accessed on 23 Aug 2024)

• OCR → Named Entity Recognition (NER)  ◦ Convert the
document to text using OCR  ◦ Identify the target phrases within the text (NER)  • Object Detection → OCR  ◦ Identify the position in the document where the target phrases are written (Object Detection)  ◦ Convert the text at that position using OCR  • End-to-End  ◦ Achieve this with a single model  • OCR → 固有表現抽出  ◦ 文書をOCRでテキストに変換  ◦ テキストの中から対象の文言を特定（固有表現抽出）  • 物体検出 → OCR  ◦ 文書から対象の文言が記載された場所を特定（物体検出）  ◦ その場所をOCRでテキストに変換  • End-to-End  ◦ 一つのモデルで実現   代表的な既存手法 Representative Existing Methods

• An end-to-End model proposed by NAVER Corporation in 2021
  • Performs image processing and OCR seamlessly  • 2021年にNAVER社から提案された End-to-Endモデル  • 画像処理とOCRを一気通貫で実行  Donut: Document Understanding Transformer The above image is cited from “OCR-free Document Understanding Transformer” authored by Geewook Kim et al on 6 Oct 2022. https://arxiv.org/abs/2111.15664 (accessed on 23 Aug 2024)

Ground Truth: “1,000.00”          どちらの方が正しい「1,000.00」？  Which one
is the correct “1,000.00” ?  Donutの問題点 Issues with Donut Donutは学習時に読み取った文言だけを与え、場所は教えない。  そのため、文書内に同じ文言が複数あった場合、どの場所から読み取るべきかを学習できない。  Donut is only provided with the text it reads during training, without being informed of the position.  Therefore, if the same text appears multiple times within the document, it cannot learn from which position it should read. 

場所の情報を使えばDonutを改善できるのではないか？  Is it possible to improve Donut by utilizing position
information?  仮説 Hypothesis

提案手法 Proposed Method

Donut + SegFormer 🍩 Donut Encoder 🍩 Donut Docoder SegFormer
Decoder <parsing> <date>2024-09-20 </date></parsing> Prompt Image Extracted Text Segmentation Map

“SegFormer is a model for semantic segmentation introduced by Xie
et al. in 2021. It has a hierarchical Transformer encoder that doesn't use positional encodings (in contrast to ViT) and a simple multi-layer perceptron decoder. SegFormer achieves state-of-the-art performance on multiple common datasets.”  The above text is cited from “Fine-Tune a Semantic Segmentation Model with a Custom Dataset” authored by Tobias Cornille, Niels Rogge  SegFormerとは？ What is SegFormer?

SegFormer Decoder The above image is cited from “SegFormer: Simple
and Efﬁcient Design for Semantic Segmentation with Transformers” authored by Enze Xie et al on 28 Oct 2021. https://arxiv.org/abs/2105.15203 (accessed on 23 Aug 2024) • Decoderのみ利用  • 2層のMLP    • Use only Decoder  • 2 MLP Layers 

実験の設定 Experiment Settings

• DocILE: Document Information Localization and Extraction  • https://docile.rossum.ai/  •
ICDAR 2023という学会で開催されたコンペで使用されたデータセット  • 画像とその画像に記載されているテキスト及びその場所を含む  • The dataset used in a competition held at the ICDAR 2023 conference.  • Including images, the text contained in those images, and their positions.  実験に使用したデータセット Dataset Used in This Experiment The above image is cited from “DocILE Benchmark for Document Information Localization and Extraction” authored by Štěpán Šimsa et al on 3 May 2023. https://arxiv.org/abs/2302.05658 (accessed on 23 Aug 2024)

前処理 Preprocessing 最も多く存在した8種類の項目のみを利用  • 学習データ: 5181件  • 評価データ: 501件  Only
the 8 most frequently appearing field types were used.  • Training Dataset: 5181 images  • Evaluation Dataset: 501 images    Top 8 Field Types There may be multiple field types in a single document, so the number of field types exceeds the number of documents, which is 5,181.

モデル訓練時の設定 Training Conﬁgurations • 画像拡張なし  • オプティマイザ  ◦ AdamW(lr=1e-5)  •
スケジューラ  ◦ ReduceLROnPlataue(patience=5, factor=0.2)  • バッチサイズ: 1  • Early Stopping:  ◦ 学習データの内5% (259件) をvalidation に使用  ◦ 1000 stepごとに評価  ◦ patience=10  • No Image Augmentation  • Optimizer:  ◦ AdamW(lr=1e-5)  • Scheduler:  ◦ ReduceLROnPlataue(patience=5, factor=0.2)  • Batch_Size: 1  • Early Stopping:  ◦ 5% (=259 images) of training datasets were used for validation  ◦ Validate every 1000 steps  ◦ patience=10 

比較対象 Comparison Target • 既存手法  ◦ Donut  ◦ Hugging Face
Hubの事前学習モデル (naver-clova-ix/donut-base)を使用  • 提案手法  ◦ Donut + SegFormer  ◦ Donut EncoderとDonut Decoderは Hugging Face Hubの事前学習モデル (naver-clova-ix/donut-base)を使用  ◦ Segformer Decoderはスクラッチで学習  • Existing Method  ◦ Donut  ◦ Use the pre-trained model available on the Hugging Face Hub. (naver-clova-ix/donut-base)  • Proposed Method  ◦ Donut + SegFormer  ◦ Use the pre-trained model (naver-clova-ix/donut-base) from the Hugging Face Hub for the Donut Encoder and Donut Decoder  ◦ The Segformer Decoder is trained from scratch. 

• AWS EC2 instance  ◦ g4dn.xlarge  ◦ 4 vCPU, 16GB
memory  ◦ NVIDIA T4  • Time to train models  ◦ Existing method (Donut)  ▪ Approx. 7.5 hours  ◦ Proposed method (Donut+SegFormer)  ▪ Approx. 8.5 hours  使用したマシン Machine Used • AWS EC2インスタンス  ◦ g4dn.xlarge  ◦ 4 vCPU, 16GB memory  ◦ NVIDIA T4  • モデル訓練時間  ◦ 既存手法 (Donut)  ▪ 約7時間半  ◦ 提案手法 (Donut + SegFormer)  ▪ 約8時間半 

実験の結果 Experiment Results

実験結果サマリー（全項目） Experiment Result Summary (All Fields) 提案手法によってF1スコアが上昇することが確認できた。  特に、文字列の類似度で評価した場合は大きくスコアが上昇している。 
It was conﬁrmed that the F1 score increased with the proposed method.   The score particularly improved when evaluated based on string similarity.  similarity = 1 - normalized Levenshtein distance Higher is better.

各項目の実験結果（F1スコア） Experiment Result For Each Field (F1 score) ベンダー名のような多く存在する項目や住所のような長い文字列では、Donutの方が
F1スコアは高い。  Donut achieves higher F1 scores for frequently appearing items like vendor names and long strings such as addresses. 

各項目の実験結果（類似度ベースF1スコア） Experiment Result For Each Field (Similarity Based F1 score)
文字列の類似度で評価した場合は、ほぼ全ての項目で提案手法が同等もしくは大きく上回るスコアを達成。  When evaluated based on string similarity, the proposed method achieved scores that were either comparable to or much higher than the existing method. 

Donut + SegFormer Position Predictions vendor_name vendor_address

Donut + SegFormer Position Predictions date_issue customer_billing_name

Donut + SegFormer Position Predictions document_id amount_due

Donut + SegFormer Position Predictions amount_total_gross customer_billing_address

結論 Conclusion

DonutにSegFormer Decoderを追加する事によって、場所の情報をうまく活用することができることがわかった。  提案手法では文字列が記載されている場所を提示することができるようになるだけではなく、精度も向上させられることがわかった。  一方、場所の特定が簡単な場合や長い文章は不得意であることがわかった。 
提案手法は特許出願済みです。  結論 Conclusion By adding the SegFormer Decoder to Donut, it was confirmed that it effectively utilizes position information.  The proposed method not only allows us to indicate the location of the text but also improves accuracy.  On the other hand, it was found to be less effective in cases where the location is easy to identify or when dealing with long texts.  The proposed method is patent-pending. 

補足 Appendix

Donut  Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung
Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park (2021). OCR-free Document Understanding Transformer. https://arxiv.org/abs/2111.15664  SegFormer  Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo (2021). SegFormer: Simple and Eﬃcient Design for Semantic Segmentation with Transformers. https://arxiv.org/abs/2105.15203  DocILE  Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas (2023). DocILE Benchmark for Document Information Localization and Extraction. https://arxiv.org/abs/2302.05658  Appendix: Citation

Donut + SegFormer ~ Enhancing Donut for positio...

Donut + SegFormer ~ Enhancing Donut for position prediction

marisakamozz

More Decks by marisakamozz

Featured

Transcript

Donut + SegFormer Donut で位置を推定する Enhancing Donut for position prediction

• Joined as an ML Engineer in 2022-03  • Kaggle

• Introduction  • Proposed Method  • Experiment Settings  • Experiment

はじめに Introduction

請求書などの文書から請求金額などの情報を読み取るタスク。様々な様式が存在するため、特定のルールで読み取ることはできない。  A task to extract information such

• OCR → Named Entity Recognition (NER)  ◦ Convert the

• An end-to-End model proposed by NAVER Corporation in 2021

Ground Truth: “1,000.00”          どちらの方が正しい「1,000.00」？  Which one

場所の情報を使えばDonutを改善できるのではないか？  Is it possible to improve Donut by utilizing position

提案手法 Proposed Method

Donut + SegFormer 🍩 Donut Encoder 🍩 Donut Docoder SegFormer

“SegFormer is a model for semantic segmentation introduced by Xie

SegFormer Decoder The above image is cited from “SegFormer: Simple

実験の設定 Experiment Settings

• DocILE: Document Information Localization and Extraction  • https://docile.rossum.ai/  •

前処理 Preprocessing 最も多く存在した8種類の項目のみを利用  • 学習データ: 5181件  • 評価データ: 501件  Only

モデル訓練時の設定 Training Conﬁgurations • 画像拡張なし  • オプティマイザ  ◦ AdamW(lr=1e-5)  •

比較対象 Comparison Target • 既存手法  ◦ Donut  ◦ Hugging Face

• AWS EC2 instance  ◦ g4dn.xlarge  ◦ 4 vCPU, 16GB

実験の結果 Experiment Results

実験結果サマリー（全項目） Experiment Result Summary (All Fields) 提案手法によってF1スコアが上昇することが確認できた。  特に、文字列の類似度で評価した場合は大きくスコアが上昇している。

各項目の実験結果（F1スコア） Experiment Result For Each Field (F1 score) ベンダー名のような多く存在する項目や住所のような長い文字列では、Donutの方が

各項目の実験結果（類似度ベースF1スコア） Experiment Result For Each Field (Similarity Based F1 score)