Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Niek Bartholomeus
October 02, 2019
Technology
190
0
Share
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.6k
From idea to production with NLP, Scala and Spark
niekbartho
3
510
Going DevOps with BMC
niekbartho
0
210
Orchestration in meatspace
niekbartho
4
2.1k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
500
DevOps for Dinosaurs
niekbartho
12
3.1k
Other Decks in Technology
See All in Technology
まだ道半ば、AI-DLCを歩み始めている話
news_it_enj
2
170
サプライチェーン攻撃への備えについて考えている #湘なんか
stefafafan
3
2.4k
コーディングエージェントはTypeScriptの 型エラーをどう自己修正しているのか
melonps
4
480
TypeScriptとAngular Signal で実現する保守性の高いアプリケーション設計 - 3層アーキテクチャによる責務分離の実践(たつかわ) https://2026.tskaigi.org/talks/10
nealle
1
340
AI時代に求められる思考のパラダイムシフト
nrinetcom
PRO
1
140
AI時代の私の技術インプットとアウトプット術
tonkotsuboy_com
3
1.3k
老舗OCIクラウドインテグレーターが語る-現場で培ったクラウドリフトのリアルと成功のカギ
shinpy
0
120
TSKaigi 2026 - Auth.jsからBetter Authへの 移行に見る「型とランタイム」の 設計思想の変化
teamlab
PRO
1
260
データ基盤構築・運用の現場から 〜 Snowflake Intelligence 導入で変わった、データ活用の未来 〜
wonohe
0
180
GitHub Copilot CLI の Rubber Duck 機能を使ってコーディングの品質をあげよう #techbaton_findy
stefafafan
2
1k
[みん強]AIの価値を最大化するデータ基盤戦略:Self-Service型Data Meshへの転換とAgentic AI Meshに向けた取り組み with Snowflake他
y_matsubara
1
180
類似画像検索モデルの開発ノウハウ
lycorptech_jp
PRO
3
800
Featured
See All Featured
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
350
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
4k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
140
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
65
54k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.3k
Speed Design
sergeychernyshev
33
1.7k
Utilizing Notion as your number one productivity tool
mfonobong
4
310
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
140
The Pragmatic Product Professional
lauravandoore
37
7.3k
Agile that works and the tools we love
rasmusluckow
331
21k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
810
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com