Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
410
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
AI Agentと MCP Serverで実現する iOSアプリの 自動テスト作成の効率化
spiderplus_cb
0
510
BtoBプロダクト開発の深層
16bitidol
0
360
職種別ミートアップで社内から盛り上げる アウトプット文化の醸成と関係強化/ #DevRelKaigi
nishiuma
2
140
GC25 Recap+: Advancing Go Garbage Collection with Green Tea
logica0419
1
420
バイブコーディングと継続的デプロイメント
nwiizo
2
440
リーダーになったら未来を語れるようになろう/Speak the Future
sanogemaru
0
290
SOC2取得の全体像
shonansurvivors
1
400
about #74462 go/token#FileSet
tomtwinkle
1
400
GA technologiesでのAI-Readyの取り組み@DataOps Night
yuto16
0
270
小学4年生夏休みの自由研究「ぼくと Copilot エージェント」
taichinakamura
0
400
OCI Network Firewall 概要
oracle4engineer
PRO
1
7.8k
Modern_Data_Stack最新動向クイズ_買収_AI_激動の2025年_.pdf
sagara
0
220
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Stop Working from a Prison Cell
hatefulcrawdad
271
21k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
Building Better People: How to give real-time feedback that sticks.
wjessup
368
20k
Designing for Performance
lara
610
69k
Done Done
chrislema
185
16k
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
The Cult of Friendly URLs
andyhume
79
6.6k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
9
850
A Modern Web Designer's Workflow
chriscoyier
697
190k
Testing 201, or: Great Expectations
jmmastey
45
7.7k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None