Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Conquering PDFs: document understanding beyond ...

Conquering PDFs: document understanding beyond plain text

NLP and data science could be so easy if all of our data came as clean and plain text. But in practice, a lot of it is hidden away in PDFs, Word documents, scans and other formats that have been a nightmare to work with. In this talk, I'll present a new and modular approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem. I'll show you how you can go from PDFs to structured data and even build fully custom information extraction pipelines for your specific use case.

For the practical examples, I'll be using spaCy, and the new Docling library and layout analysis models. I'll also cover Optical Character Recognition (OCR) for image-based text, how to convert tabular data to pandas DataFrames, and strategies for creating training and evaluation data for information extraction tasks like text classification and entity recognition using PDFs and other documents as inputs.

Blog post: https://explosion.ai/blog/pdfs-nlp-structured-data

Ines Montani

April 23, 2025
Tweet

Resources

From PDFs to AI-ready structured data: a deep dive

https://explosion.ai/blog/pdfs-nlp-structured-data

Blog post this talk is based on, featuring how to build end-to-end document understanding and information extraction pipelines for industry use cases.

Docling

https://docling-project.github.io/docling/

Open-source library and models for processing PDFs, Word documents and similar formats, including features for layout analysis, OCR and table structure recognition.

spaCy Layout

https://github.com/explosion/spacy-layout

Open-source library and plugin for processing PDFs, Word documents and more with spaCy, powered by Docling.

Prodigy PDF

https://prodi.gy/docs/plugins#pdf

Plugin for the Prodigy annotation tool, including recipes for image-based and text-based PDF annotation.

Docling Technical Report

https://arxiv.org/abs/2408.09869

Auer et al., 2024

TableFormer: Table Structure Understanding with Transformers

https://arxiv.org/abs/2203.01017

Nassar et al., 2022

A practical guide to human-in-the-loop distillation

https://explosion.ai/blog/human-in-the-loop-distillation

Practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

More Decks by Ines Montani

Other Decks in Technology

Transcript

  1. Modern scriptable annotation tool for machine learning developers prodigy.ai Alex

    Smith Developer Kim Miller Analyst GPT-4 API companies users
  2. “I have the data in a PDF.” “I have the

    data on my computer.” Documents
  3. explosion.ai/blog/human-in-the-loop-distillation At their core, many NLP systems consist of flat

    classifications. You can shove them into a single prompt, or you can decompose them into smaller pieces. Many classification tasks are straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once.
  4. docling-project.github.io/docling unified structured format DoclingDocument Open-source library and models for

    document processing Docling Technical Report (Auer et al., 2024) just like spaCy’s Doc object
  5. github.com/explosion/spacy-layout Doc object named entities, part-of-speech tags, dependencies, … transformer-based

    English pipeline PDF apply pipeline to Doc Doc object NLP pipeline Doc object processing
  6. find layout span containing an entity Apple Inc. ORG bounding

    box closest heading Company summary SECTION_HEADER
  7. TableFormer: Table Structure Understanding with Transformers (Nassar et al., 2022)

    bounding box github.com/explosion/spacy-layout pandas.DataFrame
  8. TableFormer: Table Structure Understanding with Transformers (Nassar et al., 2022)

    bounding box github.com/explosion/spacy-layout pandas.DataFrame customize table representation in text Table with columns: Names, Amount
  9. explosion.ai/blog/pdfs-nlp-structured-data Doc object processing PDF Doc object NLP pipeline annotation

    transfer learning Examine the role of layout features Does it actually matter for the task? What can we abstract away to generalize better?
  10. prodigy.ai/docs/plugins $ prodigy pdf.spans.manual papers blank:en ./pdfs -- label EVENT,PLACE

    -- focus text,list_item recipe input data sections saved to dataset
  11. $ prodigy train ./models -- ner papers - - eval-split

    0.3 NLP model output dataset evaluation %
  12. $ prodigy train ./models -- ner papers - - eval-split

    0.3 NLP model output dataset apply PDF processing evaluation %
  13. $ prodigy train ./models -- ner papers - - eval-split

    0.3 NLP model output dataset apply PDF processing apply model evaluation %
  14. $ prodigy train ./models -- ner papers - - eval-split

    0.3 NLP model output dataset apply PDF processing apply model process documents at scale ship & deploy evaluation %
  15. Work from a unified structured format and get your data

    out of PDFs as early as you can. PDFs are a bad source of truth
  16. Combine document processing with NLP components you can develop independently.

    Modularity is your superpower Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth
  17. Combine document processing with NLP components you can develop independently.

    Modularity is your superpower Layout analysis models are steadily getting better, faster and smaller! This is only the beginning Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth
  18. explosion.ai/blog/pdfs-nlp-structured-data Combine document processing with NLP components you can develop

    independently. Modularity is your superpower Layout analysis models are steadily getting better, faster and smaller! This is only the beginning Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth