Upgrade to Pro — share decks privately, control downloads, hide ads and more …

STMK24 NTCIR18 U4 Table QA Submission

Avatar for eida eida
June 11, 2025
19

STMK24 NTCIR18 U4 Table QA Submission

Slides used for the oral presentation at NTICIR18
https://research.nii.ac.jp/ntcir/ntcir-18/program.html

Avatar for eida

eida

June 11, 2025
Tweet

Transcript

  1. U4 Task: QA over clean HTML tables Real-world tables: often

    in image or PDF format Challenge: RAG needs robust table handling across formats Our approach: render HTML → solve as multimodal QA Goal: develop a practical method usable in business RAG Motivation 2 Clean HTML structure (Given in U4 task) Real-world input format (Our challenge) <table><tr><td colspan="2">回次</td><td>第65 期</td><td>第66期</td><td>第67期</td><td>第 68期</td><td>第69期</td></tr><tr><td colspan="2">決算年月</td><td>2016年1月 </td><td>2017年1月</td><td>2018年1月 </td><td>2019年1月</td>……..</table>
  2. We focused on PDF tables and used images, text, and

    layout to tackle the Table QA. Representation of Table Structures 3 Highly structured formats Low structure / unstructured formats <table><tr><td colspan="2"> 回次</td><td>第65期 </td><td>第66期</td><td>第 67期……..</table> HTML, Json, Markdown Image PDF or Image + OCR Most business documents for RAG
  3. 1. Our Strategy: Predicting Cell-id, Not Cell-Value 2. Our Input:

    Fusing 3 Modalities for Precision : Image, Text, Layout Key points of our method 4
  4. We focus on predicting Cell-IDs to bypass the LVLM's weakness

    in math. Cell-IDs (e.g., “r3c1”) are rendered directly onto the table image as visual objects. This transforms the QA task into a more straightforward visual recognition problem. Strategy 1: Cell-ID Embedding 5
  5. We use two additional modalities: Text and Layout. Layout: The

    bounding box coordinates for each text block. Benefit: Avoids complex table structure reconstruction. Strategy 2: Layout and Text Modality 6
  6. • Layout coordinates are encoded into features via an MLP.

    • Each text token is fused with its corresponding layout feature. • These text-layout pairs and image features form the final input for the LLM. Layout-Aware LVLM Architecture 7
  7. 1. General Setup • Base model: LLaVA-OneVision-7B • Fine-tuning: All

    models were fine-tuned on the Table QA dataset for 3 epochs with a learning rate of 1e-5. 2. Ablation Study Conditions • To analyze the impact of each modality, we compared the following four settings. Experiments 8 I+T+L Training with Image, Text and Layout T+L Training with Text and Layout I+T Training with Image and Text I Training with Image w/o Pre-Training
  8. • Our full model (I+T+L) achieved the highest accuracy, confirming

    the effectiveness of the multimodal approach. • The removal of Layout information (I+T) caused a performance drop, highlighting its critical role. Results 9
  9. Problem: Mismatch between cell-id (r2c4) and its actual visual column.

    Without Layout: The model is misled by the inconsistent cell-id. With Layout: Bounding box coordinates reveal the true table structure. Case Study: Why Layout Modality is Crucial 10 r2c4 r6c5
  10. Limitations • Reliance on cell-ids: Not applicable to general real-world

    documents. • Assumption of Clean Text: Not robust to noisy or handwritten tables with OCR errors. Future Work • Direct Value Prediction: To eliminate the dependency on cell-ids. • Robustness for Noisy Documents: By exploring end-to-end models or enhanced OCR. Limitations / Future Work 11
  11. • We proposed a multimodal approach for the Table QA

    task, integrating Image, Text, and Layout information. • Our experiments showed this method is highly effective, with Text and Layout proving to be the most critical modalities for achieving high accuracy. • This study highlights that combining visual, textual, and spatial context is key to robustly understanding complex structured data. Conclusion 12