Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
160
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
470
Going DevOps with BMC
niekbartho
0
190
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
450
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
Flutter向けPDFビューア、pdfrxのpdfium WASM対応について
espresso3389
0
120
モバイル界のMCPを考える
naoto33
0
410
OPENLOGI Company Profile
hr01
0
67k
Lambda Web Adapterについて自分なりに理解してみた
smt7174
6
160
20250705 Headlamp: 專注可擴展性的 Kubernetes 用戶界面
pichuang
0
190
Tokyo_reInforce_2025_recap_iam_access_analyzer
hiashisan
0
170
Backlog ユーザー棚卸しRTA、多分これが一番早いと思います
__allllllllez__
1
130
American airlines ®️ USA Contact Numbers: Complete 2025 Support Guide
airhelpsupport
0
200
Glacierだからってコストあきらめてない? / JAWS Meet Glacier Cost
taishin
1
140
Zero Data Loss Autonomous Recovery Service サービス概要
oracle4engineer
PRO
2
7.7k
「クラウドコスト絶対削減」を支える技術—FinOpsを超えた徹底的なクラウドコスト削減の実践論
delta_tech
4
130
2025-07-06 QGIS初級ハンズオン「はじめてのQGIS」
kou_kita
0
150
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.8k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
2.9k
Practical Orchestrator
shlominoach
188
11k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
8
680
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Producing Creativity
orderedlist
PRO
346
40k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
138
34k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
810
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com