Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
420
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
110
data.occrp.org
pudo
0
160
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
290
Dr. Freezefile
pudo
2
420
Intro presentation for Naivasha
pudo
1
170
Other Decks in Technology
See All in Technology
Digitization部 紹介資料
sansan33
PRO
1
6.1k
TypeScript×CASLでつくるSaaSの認可 / Authz with CASL
saka2jp
2
180
なぜフロントエンド技術を追うのか?なぜカンファレンスに参加するのか?
sakito
8
1.9k
"'TSのAPI型安全”の対価は誰が払う?不公平なスキーマ駆動に終止符を打つハイブリッド戦略
hal_spidernight
0
220
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
0
620
.NET 10 のパフォーマンス改善
nenonaninu
2
4.3k
命名から始めるSpec Driven
kuruwic
3
790
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
3.3k
都市スケールAR制作で気をつけること
segur
0
220
AI エージェント活用のベストプラクティスと今後の課題
asei
2
450
あなたの知らないDateのひみつ / The Secret of "Date" You Haven't known #tqrk16
expajp
0
110
小規模チームによる衛星管制システムの開発とスケーラビリティの実現
sankichi92
0
180
Featured
See All Featured
Automating Front-end Workflow
addyosmani
1371
200k
Producing Creativity
orderedlist
PRO
348
40k
Writing Fast Ruby
sferik
630
62k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
YesSQL, Process and Tooling at Scale
rocio
174
15k
Docker and Python
trallard
46
3.7k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3k
How STYLIGHT went responsive
nonsquared
100
5.9k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.6k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
690
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None