Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
280
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
Ultra Ethernet (UEC) v1.0 仕様概説
markunet
3
110
Bill One 開発エンジニア 紹介資料
sansan33
PRO
5
18k
パネルディスカッション資料 (at Tableau Now! - 2026-02-26)
yoshitakaarakawa
0
1.1k
LLM活用の壁を超える:リクルートR&Dの戦略と打ち手
recruitengineers
PRO
1
220
トップマネジメントとコンピテンシーから考えるエンジニアリングマネジメント
zigorou
3
470
自動テストが巻き起こした開発プロセス・チームの変化 / Impact of Automated Testing on Development Cycles and Team Dynamics
codmoninc
1
990
生成AI活用によるPRレビュー改善の歩み
lycorptech_jp
PRO
5
2k
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
14k
バクラクのSREにおけるAgentic AIへの挑戦/Our Journey with Agentic AI
taddy_919
2
980
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
5
1.1k
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
71k
Lookerの最新バージョンv26.2がやばい話
waiwai2111
1
150
Featured
See All Featured
Reality Check: Gamification 10 Years Later
codingconduct
0
2k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
1
140
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
250
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
590
WENDY [Excerpt]
tessaabrams
9
36k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.1k
Automating Front-end Workflow
addyosmani
1370
200k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
230
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.4k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None