Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
170
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
480
Going DevOps with BMC
niekbartho
0
200
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
460
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
20251014_Pythonを実務で徹底的に使いこなした話
ippei0923
0
210
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
8.8k
Findy Team+ QAチーム これからのチャレンジ!
findy_eventslides
0
420
コンテキストエンジニアリング入門〜AI Coding Agent作りで学ぶ文脈設計〜
kworkdev
PRO
3
1.6k
難しいセキュリティ用語をわかりやすくしてみた
yuta3110
0
250
All About Sansan – for New Global Engineers
sansan33
PRO
1
1.2k
新規事業におけるGORM+SQLx併用アーキテクチャ
hacomono
PRO
0
320
WEBサービスを成り立たせるAWSサービス
takano0131
1
180
AI Agent Dojo #2 watsonx Orchestrateフローの作成
oniak3ibm
PRO
0
130
CoRL 2025 Survey
harukiabe
1
210
Adminaで実現するISMS/SOC2運用の効率化 〜 アカウント管理編 〜
shonansurvivors
4
450
能登半島災害現場エンジニアクロストーク 【JAWS FESTA 2025 in 金沢】
ditccsugii
0
880
Featured
See All Featured
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Testing 201, or: Great Expectations
jmmastey
45
7.7k
Making the Leap to Tech Lead
cromwellryan
135
9.6k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.2k
Visualization
eitanlees
149
16k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.7k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Raft: Consensus for Rubyists
vanstee
140
7.1k
Navigating Team Friction
lara
190
15k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
115
20k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
34
2.3k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com