Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Document Analysis Layout Analysis, OCR and NLP

Avatar for Mahdi Khashan Mahdi Khashan
June 02, 2025
10

Document Analysis Layout Analysis, OCR and NLP

TU Wien 183.628 Document Analysis SS2025

Avatar for Mahdi Khashan

Mahdi Khashan

June 02, 2025
Tweet

Transcript

  1. Task B - OCR • PSM out of the box

    • Requires Preprocessing • Ease of use • Probably better results Vision Language Models • Better accuracy in complex layouts • Different fonts and styles • No PSM • Challenging usage
  2. Task B - OCR - Full Image Mean WER =

    0.984 , Mean CER = 0.984
  3. Task B - OCR - Color Cropped Mean WER =

    1.009 , Mean CER = 0.817
  4. Task B - OCR - Binarized, Cropped with Line Detection

    Neunkirchner Alleerte -------------------------------- Aufn. Dr . Machura 1948. Film Nr . 17 . Neg. 17 / WER = 1.83 CER = 0.789 A Gr. Riedenthal Tobacco Co. mit \"Neun Mauna\" und'e de NOMTAY JOURNAL, J. W. W. LIMA VV- 4 0.0 0.0 0.0 0.0 0 Aufn. Meisinger 1956. Rollfilm Nr. 39, Neg. Nr. 10,9. WER = 2.75 CER = 0.82
  5. Task C - NLP (Spacy with “de_core_news_lg”) 0001 Neunkirchner Alleerte

    -------------------------------- Aufn. Dr . Machura 1948. Film Nr . 17 . Neg. 17 / Neunkirchner Alleerte - MISC Nr - MISC Machura - PER Neg - PER 0005 Massachusetts, Massachusetts Flugerdeverwenungen.co m CAUTH. Meisinger, J.W. Rollfilm 4x6. N. Sch. K. 258 Massachusetts - LOC Massachusetts - LOC Flugerdeverwenungen.co m\nCAUTH - MISC Meisinger - PER N. Sch - PER 0030 A Gr. Riedenthal Tobacco Co. mit \"Neun Mauna\" und'e de NOMTAY JOURNAL, J. W. W. LIMA VV- 4 0.0 0.0 0.0 0.0 0 Aufn. Meisinger 1956. Rollfilm Nr. 39, Neg. Nr. 10,9. A Gr - MISC Nr. 39 - MISC Tobacco - ORG LIMA VV- - ORG Mauna - LOC Neg - LOC J. W. W. - PER Meisinger - PER
  6. Task C - NLP (LLM) Gemini 2.5 Pro GPT o-4

    gemma-3-4b-it (Local Inference) you are an NER model, your task is to find following information on a german text: Location, Description, Date, Photographer, Film return the result as a JSON, use the following format: { "Location": "Helenental", "Description": "unerlaubte Rodung im Schutzgebiet", "Date": "April 1948", "Photographer": "Meisinger", "Film": "Neg.Nr. 3254/KIX/16, Film", } I will send text, are you ready? you are an NER model, your task is to find following information on a german text: Location, Description, Date, Photographer, Film return the result as a JSON, use the following format: { "Location": "Helenental", "Description": "unerlaubte Rodung im Schutzgebiet", "Date": "April 1948", "Photographer": "Meisinger", "Film": "Neg.Nr. 3254/KIX/16, Film", } I will send text, are you ready? can you act as NER and find information about this keys in the following text, the language of the text id german: 1. Location 2. Description 3. Date 4. Photographer 5. Film text: Regelsorunner Au.S.A.A. Aufn. Dr. Machura 1948 Film Nr. 19, Neg, Nr. 21. u. 22. your output should be a json like this: { "Location": "", "Description": "", "Date": "", "Photographer": "", "Film": "", } { "Location": "Neunkirchner Allee", "Description": "", "Date": "1948", "Photographer": "Dr. Machura", "Film": "Film Nr. 17, Neg. 17 /" } { "Location": "Neunkirchner Allee", "Description": "", "Date": "1948", "Photographer": "Dr. Machura", "Film": "Film Nr. 17, Neg. 17" } { "Location": "Aufn. Dr. Machura", "Description": "Regelsorunner Au.S.A.A.", "Date": "1948", "Photographer": null, "Film": "Film Nr. 19, Neg, Nr. 21. u. 22." }
  7. Task C - NLP (Results) LLM (Agent) Spacy • Great

    result on Large models • Larger model, better context, better result • Trained on Huge data • Cost (Hardware and time) • Better on Location labeling • Poor results • Unrelated dataset (news) • PCA helped, but still low variance (high overlap) • PER PCA (41% information for both pcas) • Better OCR, better NLP • Person names include "Dr. Machura", "Meisinger", "Kupelwieser". • Many location names appear to be Austrian or German ("Neunkirchner Allee", "Klamm bei Schottwien", "Schneeberg", "Lilienfeld-Klosteralm" )
  8. Reflection • Some documents contained more than 1 printed text

    region • trOCR accuracy decreased for later artifacts • Metadata caches everything, my disk got full multiple time, resulting in failure of pipeline • Spacy vectors were all null! • PSM, probably the bottleneck of my solution • Baseline model in OCR can be helpful, useful for validation • I don’t know how to debug OCR and NLP models and validate their results • T-SNE: 88266 segmentation fault python tsne.py