Conquering PDFs: document understanding beyond plain text

Document understanding beyond plain text Ines Montani Explosion

Open-source library for industrial-strength natural language processing spacy.io downloads

Open-source library for industrial-strength natural language processing spacy.io ChatGPT can
write spaCy code! downloads

Modern scriptable annotation tool for machine learning developers prodigy.ai companies
users

Modern scriptable annotation tool for machine learning developers prodigy.ai Alex
Smith Developer Kim Miller Analyst GPT-4 API companies users

Documents

Businesses want electronic copies that map 1:1 to paper documents.
Documents

Documents

“I have the data in a PDF.” “I have the
data on my computer.” Documents

explosion.ai/blog/human-in-the-loop-distillation At their core, many NLP systems consist of flat
classifications. You can shove them into a single prompt, or you can decompose them into smaller pieces. Many classification tasks are straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once.

docling-project.github.io/docling Open-source library and models for document processing Docling Technical
Report (Auer et al., 2024)

docling-project.github.io/docling unified structured format DoclingDocument Open-source library and models for
document processing Docling Technical Report (Auer et al., 2024)

docling-project.github.io/docling unified structured format DoclingDocument Open-source library and models for
document processing Docling Technical Report (Auer et al., 2024) just like spaCy’s Doc object

github.com/explosion/spacy-layout

github.com/explosion/spacy-layout process and create a spaCy Doc

github.com/explosion/spacy-layout text-based contents process and create a spaCy Doc

github.com/explosion/spacy-layout text-based contents document layout process and create a spaCy
Doc

github.com/explosion/spacy-layout text-based contents layout sections document layout process and create
a spaCy Doc

github.com/explosion/spacy-layout text-based contents layout sections document layout content, tokens, o
sets process and create a spaCy Doc

github.com/explosion/spacy-layout text-based contents layout sections section type document layout content,
tokens, o sets process and create a spaCy Doc

github.com/explosion/spacy-layout text-based contents layout sections section type document layout bounding
box content, tokens, o sets process and create a spaCy Doc

github.com/explosion/spacy-layout PDF

github.com/explosion/spacy-layout transformer-based English pipeline PDF

github.com/explosion/spacy-layout transformer-based English pipeline PDF Doc object processing

github.com/explosion/spacy-layout transformer-based English pipeline PDF apply pipeline to Doc Doc
object NLP pipeline Doc object processing

github.com/explosion/spacy-layout Doc object named entities, part-of-speech tags, dependencies, … transformer-based
English pipeline PDF apply pipeline to Doc Doc object NLP pipeline Doc object processing

find layout span containing an entity

find layout span containing an entity Apple Inc. ORG

find layout span containing an entity Apple Inc. ORG bounding
box

box closest heading

box closest heading Company summary SECTION_HEADER

github.com/explosion/spacy-layout

bounding box github.com/explosion/spacy-layout

bounding box github.com/explosion/spacy-layout pandas.DataFrame

TableFormer: Table Structure Understanding with Transformers (Nassar et al., 2022)
bounding box github.com/explosion/spacy-layout pandas.DataFrame

TableFormer: Table Structure Understanding with Transformers (Nassar et al., 2022)
bounding box github.com/explosion/spacy-layout pandas.DataFrame customize table representation in text Table with columns: Names, Amount

layout analysis natural language processing

explosion.ai/blog/pdfs-nlp-structured-data

explosion.ai/blog/pdfs-nlp-structured-data Doc object processing PDF

explosion.ai/blog/pdfs-nlp-structured-data Doc object processing PDF Doc object NLP pipeline

explosion.ai/blog/pdfs-nlp-structured-data Doc object processing PDF Doc object NLP pipeline annotation

transfer learning

transfer learning Examine the role of layout features Does it actually matter for the task? What can we abstract away to generalize better?

prodigy.ai/docs/plugins $ prodigy pdf.spans.manual papers blank:en ./pdfs -- label EVENT,PLACE
-- focus text,list_item

-- focus text,list_item recipe

-- focus text,list_item recipe input data

-- focus text,list_item recipe input data sections

-- focus text,list_item recipe input data sections saved to dataset

$ prodigy train ./models -- ner papers - - eval-split
0.3

0.3 output

0.3 output dataset

0.3 output dataset evaluation %

0.3 NLP model output dataset evaluation %

0.3 NLP model output dataset apply PDF processing evaluation %

0.3 NLP model output dataset apply PDF processing apply model evaluation %

0.3 NLP model output dataset apply PDF processing apply model process documents at scale ship & deploy evaluation %

Work from a unified structured format and get your data
out of PDFs as early as you can. PDFs are a bad source of truth

Combine document processing with NLP components you can develop independently.
Modularity is your superpower Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth

Combine document processing with NLP components you can develop independently.
Modularity is your superpower Layout analysis models are steadily getting better, faster and smaller! This is only the beginning Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth

explosion.ai/blog/pdfs-nlp-structured-data Combine document processing with NLP components you can develop
independently. Modularity is your superpower Layout analysis models are steadily getting better, faster and smaller! This is only the beginning Work from a unified structured format and get your data out of PDFs as early as you can. PDFs are a bad source of truth

Explosion spaCy Prodigy Mastodon Bluesky explosion.ai spacy.io prodigy.ai @[email protected] @inesmontani.bsky.social
LinkedIn

Conquering PDFs: document understanding beyond ...

Conquering PDFs: document understanding beyond plain text

Video

Resources

From PDFs to AI-ready structured data: a deep dive

Docling

spaCy Layout

Prodigy PDF

Docling Technical Report

TableFormer: Table Structure Understanding with Transformers

A practical guide to human-in-the-loop distillation

More Decks by Ines Montani

Other Decks in Technology

Featured

Transcript