Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
150
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
460
Going DevOps with BMC
niekbartho
0
180
Orchestration in meatspace
niekbartho
4
1.9k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
420
DevOps for Dinosaurs
niekbartho
12
2.9k
Other Decks in Technology
See All in Technology
ISUCONにPHPで挑み続けてできるようになっ(てき)たこと / phperkaigi2025
blue_goheimochi
0
120
RF問の対策をした話
bata_24
0
140
ClineにNext.jsのプロジェクト改善をお願いしてみた / 20250321_reacttokyo_LT
optim
1
990
Redefine_Possible
upsider_tech
0
130
職種に名前が付く、ということ/The fact that a job title has a name
bitkey
1
180
AWS のポリシー言語 Cedar を活用した高速かつスケーラブルな認可技術の探求 #phperkaigi / PHPerKaigi 2025
ytaka23
7
1.3k
ランチの間に GitHub Copilot Agent が仕事を終わらせてくれた話
bicstone
5
670
IAMのマニアックな話 2025 ~40分バージョン ~
nrinetcom
PRO
4
370
Oracle Cloud Infrastructure:2025年3月度サービス・アップデート
oracle4engineer
PRO
0
280
目次機能実装から理解するLexical Editor
wtdlee
0
120
コンテナ上シェル悪用の話とPure Bashでcurlが作れた話
ryotosaito
2
390
【Oracle Cloud ウェビナー】VMware環境を短期間でクラウド化!ベネッセ様事例に学ぶ仮想環境クラウド移行のリアル
oracle4engineer
PRO
2
110
Featured
See All Featured
Music & Morning Musume
bryan
46
6.4k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
30
1.1k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Java REST API Framework Comparison - PWX 2021
mraible
29
8.5k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.5k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.8k
Practical Orchestrator
shlominoach
186
10k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2.1k
Embracing the Ebb and Flow
colly
84
4.6k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
VelocityConf: Rendering Performance Case Studies
addyosmani
328
24k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com