Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
410
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
400
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
「Chatwork」の認証基盤の移行とログ活用によるプロダクト改善
kubell_hr
1
100
成立するElixirの再束縛(再代入)可という選択
kubell_hr
0
950
本当に使える?AutoUpgrade の新機能を実践検証してみた
oracle4engineer
PRO
1
120
Prox Industries株式会社 会社紹介資料
proxindustries
0
210
rubygem開発で鍛える設計力
joker1007
1
130
“社内”だけで完結していた私が、AWS Community Builder になるまで
nagisa53
1
230
Welcome to the LLM Club
koic
0
140
変化する開発、進化する体系時代に適応するソフトウェアエンジニアの知識と考え方(JaSST'25 Kansai)
mizunori
0
140
Clineを含めたAIエージェントを 大規模組織に導入し、投資対効果を考える / Introducing AI agents into your organization
i35_267
4
1.4k
Amazon Bedrockで実現する 新たな学習体験
kzkmaeda
1
400
エンジニア向け技術スタック情報
kauche
0
110
【TiDB GAME DAY 2025】Shadowverse: Worlds Beyond にみる TiDB 活用術
cygames
0
890
Featured
See All Featured
Code Review Best Practice
trishagee
68
18k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Rails Girls Zürich Keynote
gr2m
94
14k
Mobile First: as difficult as doing things right
swwweet
223
9.7k
The Invisible Side of Design
smashingmag
299
51k
The Language of Interfaces
destraynor
158
25k
The World Runs on Bad Software
bkeepers
PRO
69
11k
Embracing the Ebb and Flow
colly
86
4.7k
Designing for Performance
lara
609
69k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
281
13k
Making Projects Easy
brettharned
116
6.3k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None