Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
openthebox.be - smart publications
Search
Niek Bartholomeus
October 02, 2019
Technology
0
150
openthebox.be - smart publications
Extracting deep insights from boring documents: a real-life story
Niek Bartholomeus
October 02, 2019
Tweet
Share
More Decks by Niek Bartholomeus
See All by Niek Bartholomeus
openthebox.be
niekbartho
1
2.5k
From idea to production with NLP, Scala and Spark
niekbartho
3
450
Going DevOps with BMC
niekbartho
0
180
Orchestration in meatspace
niekbartho
4
1.9k
Self-organization vs. global optimization - a comparison between traditional and modern organizations
niekbartho
2
420
DevOps for Dinosaurs
niekbartho
12
2.9k
Other Decks in Technology
See All in Technology
困難を「一般解」で解く
fujiwara3
8
2.5k
AIエージェント時代のエンジニアになろう #jawsug #jawsdays2025 / 20250301 Agentic AI Engineering
yoshidashingo
9
4.4k
IAMのマニアックな話2025
nrinetcom
PRO
6
1.6k
Global Databaseで実現するマルチリージョン自動切替とBlue/Greenデプロイ
j2yano
0
200
OCI Success Journey OCIの何が評価されてる?疑問に答える事例セミナー(2025年2月実施)
oracle4engineer
PRO
2
270
OSSの実装を参考にBedrockエージェントを作る
moritalous
2
300
アウトカムを最大化させるプロダクトエンジニアの動き
hacomono
PRO
0
110
完璧を捨てろ! “攻め”のQAがもたらすスピードと革新/20250306 Hiroki Hachisuka
shift_evolve
0
160
エンジニア主導の企画立案を可能にする組織とは?
recruitengineers
PRO
1
350
AIエージェント開発のノウハウと課題
pharma_x_tech
9
5.6k
生成AIがローコードツールになる時代の エンジニアの役割を考える
khwada
0
350
貧民的プログラミングのすすめ
kakehashi
PRO
2
300
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
44
7.1k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
30
4.6k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
45
9.4k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
13
1k
Fontdeck: Realign not Redesign
paulrobertlloyd
83
5.4k
Done Done
chrislema
182
16k
Navigating Team Friction
lara
183
15k
Side Projects
sachag
452
42k
Speed Design
sergeychernyshev
28
820
The Invisible Side of Design
smashingmag
299
50k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.5k
Transcript
openthebox.be Extracting deep insights from 'boring' documents: a real-life story
Me Niek Bartholomeus @niekbartho • Background as a software developer
• Switched to data science and natural language processing in 2016 • Founded openthebox.be in 2017
openthebox.be
openthebox.be Open data KBO NBB Belgian Official Gazette http://kbopub.economie.fgov.be/kbopub https://cri.nbb.be/bc9/web/catalog
http://www.ejustice.just.fgov.be/ tsv/tsvn.htm
knowledge graph Visualization Analytics Machine learning Knowledge graph Structured data
Unstructured data KBO NBB Belgian Official Gazette
Unstructured data - pipeline
Unstructured data - pipeline steps 1] OCR 2] NER 4]
Entity linking 3] Relation extraction
Unstructured data - pipeline steps 1] OCR
Unstructured data - pipeline steps 2] NER
Unstructured data - pipeline steps 2] NER Pre-processing rules: [“1.Jan”,
“Janssens”] 1.Jan Janssens [“Marktstraat”, “54,8450”, “Bredene”] Marktstraat 54,8450 Bredene
Unstructured data - pipeline steps 2] NER Post-processing rules: +
= General rules Legal rules Historic probabilities Faulty publication Context Improved publication
Unstructured data - pipeline steps 2] NER Organization Person Inheritance:
Notary Owner Representative Proxy holder Administrator Author : “is a” relationship Base labels Subclass labels
Unstructured data - pipeline steps 2] NER Gentstraat 69 Niek
Roger Camiel Bartholomeus Sub entity extraction: First name: Niek Middle names: Roger, Camiel Last name: Bartholomeus 9170 Sint-Pauwels Street: Gentstraat Number: 69 Zip code: 9170 City: Sint-Pauwels
Unstructured data - pipeline steps 3] Relation extraction
Unstructured data - pipeline steps 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Niek
Bartholomeus N. Bartholomeus Bartholomeus } Niek Roger Camiel Bartholomeus Deduplication: 4] Entity linking
Unstructured data - pipeline steps Niek Roger Camiel Bartholomeus Link
with knowledge graph: Gentstraat 69 9170 Sint-Pauwels 4] Entity linking
openthebox.be
openthebox.be Bigger picture
openthebox.be http://wpmlabs.com/ Academia Industry https://www.filter-concept.com/ +
openthebox.be https://opensenselabs.com