Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
290
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
visionOS 開発向けの MCP / Skills をつくり続けることで XR の探究と学習を最大化
karad
1
1.2k
GitHub Copilot CLI で Azure Portal to Bicep
tsubakimoto_s
0
180
2026年もソフトウェアサプライチェーンのリスクに立ち向かうために / Product Security Square #3
flatt_security
1
750
めちゃくちゃ開発するQAエンジニアになって感じたメリットとこれからの課題感
ryuhei0000yamamoto
0
270
エンジニアリングマネージャーの仕事
yuheinakasaka
0
130
DDD×仕様駆動で回す高品質開発のプロセス設計
littlehands
5
2.2k
Goのerror型がシンプルであることの恩恵について理解する
yamatai1212
1
290
中央集権型を脱却した話 分散型をやめて、連邦型にたどり着くまで
sansantech
PRO
1
200
Bill One 開発エンジニア 紹介資料
sansan33
PRO
5
18k
Astro Islandsの 内部実装を 「日本で一番わかりやすく」 ざっくり解説!
knj
0
220
俺の/私の最強アーキテクチャ決定戦開催 ― チームで新しいアーキテクチャに適合していくために / 20260322 Naoki Takahashi
shift_evolve
PRO
1
410
スピンアウト講座04_ルーティン処理
overflowinc
0
980
Featured
See All Featured
Designing for humans not robots
tammielis
254
26k
HDC tutorial
michielstock
1
580
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.7k
The untapped power of vector embeddings
frankvandijk
2
1.6k
The Language of Interfaces
destraynor
162
26k
Believing is Seeing
oripsolob
1
93
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
0
240
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
180
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
250
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Producing Creativity
orderedlist
PRO
348
40k
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None