Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
160
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
470
Going DevOps with BMC
niekbartho
0
190
Orchestration in meatspace
niekbartho
4
2k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
450
DevOps for Dinosaurs
niekbartho
12
3k
Other Decks in Technology
See All in Technology
Snowflake Intelligenceという名のAI Agentが切り開くデータ活用の未来とその実現に必要なこと@SnowVillage『Data Management #1 Summit 2025 Recap!!』
ryo_suzuki
1
160
Amazon SNSサブスクリプションの誤解除を防ぐ
y_sakata
3
190
Data Engineering Study#30 LT資料
tetsuroito
1
180
Four Keysから始める信頼性の改善 - SRE NEXT 2025
ozakikota
0
410
「現場で活躍するAIエージェント」を実現するチームと開発プロセス
tkikuchi1002
3
300
「Chatwork」のEKS環境を支えるhelmfileを使用したマニフェスト管理術
hanayo04
1
400
助けて! XからWaylandに移行しないと新しいGNOMEが使えなくなっちゃう 2025-07-12
nobutomurata
2
200
cdk initで生成されるあのファイル達は何なのか/cdk-init-generated-files
tomoki10
1
670
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
3
18k
名刺メーカーDevグループ 紹介資料
sansan33
PRO
0
820
shake-upを科学する
rsakata
7
1k
AWS CDK 入門ガイド これだけは知っておきたいヒント集
anank
5
750
Featured
See All Featured
Building a Modern Day E-commerce SEO Strategy
aleyda
42
7.4k
We Have a Design System, Now What?
morganepeng
53
7.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.8k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.4k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.4k
Docker and Python
trallard
45
3.5k
Faster Mobile Websites
deanohume
308
31k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
970
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
750
The Cost Of JavaScript in 2023
addyosmani
51
8.6k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com