Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
150
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
460
Going DevOps with BMC
niekbartho
0
180
Orchestration in meatspace
niekbartho
4
1.9k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
430
DevOps for Dinosaurs
niekbartho
12
2.9k
Other Decks in Technology
See All in Technology
経済メディア編集部の実務に小さく刺さるAI / small-ai-with-editorial
nkzn
2
240
製造業向けIoTソリューション提案資料.pdf
haruki_uiru
0
240
問 1:以下のコンパイラを証明せよ(予告編) #kernelvm / Kernel VM Study Kansai 11th
ytaka23
3
480
Как мы автоматизировали интеграционное тестирование с Gonkey и не пожалели. Паша Егорычев, Кирилл Поляков
lamodatech
0
2.1k
DynamoDB のデータを QuickSight で可視化する際につまづいたこと/stumbling-blocks-when-visualising-dynamodb-with-quicksight
emiki
0
140
ソフトウェアテスト 最初の一歩 〜テスト設計技法をワークで体験しながら学ぶ〜 #JaSSTTokyo / SoftwareTestingFirstStep
nihonbuson
PRO
1
140
TanStack Start 技術選定の裏側 / Findy-Lunch-LT-TanStack-Start
iktakahiro
0
110
使えるデータ基盤を作る技術選定の秘訣 / selecting-the-right-data-technology
pei0804
5
850
Kaigi Effect 2025 #rubykaigi2025_after
sue445
0
100
Datadog のトライアルを成功に導く技術 / Techniques for a successful Datadog trial
nulabinc
PRO
0
130
AIによるコードレビューで開発体験を向上させよう!
moongift
PRO
0
420
社会人力と研究力ー博士号をキャリアの武器にするー
kentaro
2
110
Featured
See All Featured
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
48
5.4k
We Have a Design System, Now What?
morganepeng
52
7.6k
How GitHub (no longer) Works
holman
314
140k
Scaling GitHub
holman
459
140k
Designing for Performance
lara
608
69k
4 Signs Your Business is Dying
shpigford
183
22k
Build The Right Thing And Hit Your Dates
maggiecrowley
35
2.7k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Into the Great Unknown - MozCon
thekraken
38
1.8k
Unsuck your backbone
ammeep
671
58k
Java REST API Framework Comparison - PWX 2021
mraible
31
8.6k
[RailsConf 2023] Rails as a piece of cake
palkan
54
5.5k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com