STMK24 NTCIR18 U4 Table QA Submission

STMK24 NTCIR18 U4 Table QA Submission Hayato Aida, Kosuke Takahashi,
Takahiro Omi Stockmark, Japan

U4 Task: QA over clean HTML tables Real-world tables: often
in image or PDF format Challenge: RAG needs robust table handling across formats Our approach: render HTML → solve as multimodal QA Goal: develop a practical method usable in business RAG Motivation 2 Clean HTML structure (Given in U4 task) Real-world input format (Our challenge) <table><tr><td colspan="2">回次</td><td>第65 期</td><td>第66期</td><td>第67期</td><td>第 68期</td><td>第69期</td></tr><tr><td colspan="2">決算年月</td><td>2016年１月 </td><td>2017年１月</td><td>2018年１月 </td><td>2019年１月</td>……..</table>

We focused on PDF tables and used images, text, and
layout to tackle the Table QA. Representation of Table Structures 3 Highly structured formats Low structure / unstructured formats <table><tr><td colspan="2"> 回次</td><td>第65期 </td><td>第66期</td><td>第 67期……..</table> HTML, Json, Markdown Image PDF or Image + OCR Most business documents for RAG

1. Our Strategy: Predicting Cell-id, Not Cell-Value 2. Our Input:
Fusing 3 Modalities for Precision : Image, Text, Layout Key points of our method 4

We focus on predicting Cell-IDs to bypass the LVLM's weakness
in math. Cell-IDs (e.g., “r3c1”) are rendered directly onto the table image as visual objects. This transforms the QA task into a more straightforward visual recognition problem. Strategy 1: Cell-ID Embedding 5

We use two additional modalities: Text and Layout. Layout: The
bounding box coordinates for each text block. Benefit: Avoids complex table structure reconstruction. Strategy 2: Layout and Text Modality 6

• Layout coordinates are encoded into features via an MLP.
• Each text token is fused with its corresponding layout feature. • These text-layout pairs and image features form the final input for the LLM. Layout-Aware LVLM Architecture 7

1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All
models were fine-tuned on the Table QA dataset for 3 epochs with a learning rate of 1e-5. 2. Ablation Study Conditions • To analyze the impact of each modality, we compared the following four settings. Experiments 8 I+T+L Training with Image, Text and Layout T+L Training with Text and Layout I+T Training with Image and Text I Training with Image w/o Pre-Training

• Our full model (I+T+L) achieved the highest accuracy, confirming
the effectiveness of the multimodal approach. • The removal of Layout information (I+T) caused a performance drop, highlighting its critical role. Results 9

Problem: Mismatch between cell-id (r2c4) and its actual visual column.
Without Layout: The model is misled by the inconsistent cell-id. With Layout: Bounding box coordinates reveal the true table structure. Case Study: Why Layout Modality is Crucial 10 r2c4 r6c5

Limitations • Reliance on cell-ids: Not applicable to general real-world
documents. • Assumption of Clean Text: Not robust to noisy or handwritten tables with OCR errors. Future Work • Direct Value Prediction: To eliminate the dependency on cell-ids. • Robustness for Noisy Documents: By exploring end-to-end models or enhanced OCR. Limitations / Future Work 11

• We proposed a multimodal approach for the Table QA
task, integrating Image, Text, and Layout information. • Our experiments showed this method is highly effective, with Text and Layout proving to be the most critical modalities for achieving high accuracy. • This study highlights that combining visual, textual, and spatial context is key to robustly understanding complex structured data. Conclusion 12

STMK24 NTCIR18 U4 Table QA Submission

STMK24 NTCIR18 U4 Table QA Submission

eida

More Decks by eida

Featured

Transcript

STMK24 NTCIR18 U4 Table QA Submission Hayato Aida, Kosuke Takahashi,

U4 Task: QA over clean HTML tables Real-world tables: often

We focused on PDF tables and used images, text, and

1. Our Strategy: Predicting Cell-id, Not Cell-Value 2. Our Input:

We focus on predicting Cell-IDs to bypass the LVLM's weakness

We use two additional modalities: Text and Layout. Layout: The

• Layout coordinates are encoded into features via an MLP.

1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All

• Our full model (I+T+L) achieved the highest accuracy, confirming

Problem: Mismatch between cell-id (r2c4) and its actual visual column.

Limitations • Reliance on cell-ids: Not applicable to general real-world

• We proposed a multimodal approach for the Table QA