Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
190
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.7k
From idea to production with NLP, Scala and Spark
niekbartho
3
510
Going DevOps with BMC
niekbartho
0
220
Orchestration in meatspace
niekbartho
4
2.1k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
510
DevOps for Dinosaurs
niekbartho
12
3.1k
Other Decks in Technology
See All in Technology
AIの性能が向上しても未解決な組織の重大問題は何か?/An Unsolved Organizational Problem in the Age of AI
moriyuya
3
600
Building applications in the Gemini API family.
line_developers_tw
PRO
0
2.8k
SIer20年! 培ったスキルがスタートアップで輝く時
shucho0103
0
830
2026TECHFRESH畢業分享會 - 原生還是跨平台? App 開發踩坑實錄
line_developers_tw
PRO
0
690
「速く作る」から「正しく作る」へ ─ 生成AI時代の開発フロー改革の ロードマップと実行 ─
starfish719
0
9.7k
タクシーアプリ『GO』の実践的データ活用
mot_techtalk
3
190
データサイエンスを価値につなげるプロジェクト設計 〜 DS一年目が現場で得た気づき 〜
ysd113
1
120
新しいVibe Codingと”自走”について
watany
5
290
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
2.9k
価格.comをAI駆動で全面刷新する ー 30年分の技術的負債を返し、次の30年の土台をつくる ー / AI Engineering Summit Tokyo 2026
tkyowa
53
59k
2026TECHFRESH畢業分享會 - Lightning Talk - 打造精準高效的 MCP 設計模式與測試實務
line_developers_tw
PRO
0
680
Oracle AI Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
6
1.9k
Featured
See All Featured
GitHub's CSS Performance
jonrohan
1033
470k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
830
Test your architecture with Archunit
thirion
1
2.3k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.8k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
200
Building Applications with DynamoDB
mza
96
7.1k
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Automating Front-end Workflow
addyosmani
1370
210k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
11
940
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
200
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com