Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
94
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
230
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
380
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
AWS re:Invent 2024で発表された コードを書く開発者向け機能について
maruto
0
200
How to be an AWS Community Builder | 君もAWS Community Builderになろう!〜2024 冬 CB募集直前対策編?!〜
coosuke
PRO
2
2.8k
どちらを使う?GitHub or Azure DevOps Ver. 24H2
kkamegawa
0
860
PHPerのための計算量入門/Complexity101 for PHPer
hanhan1978
5
200
生成AIのガバナンスの全体像と現実解
fnifni
1
190
KubeCon NA 2024 Recap: How to Move from Ingress to Gateway API with Minimal Hassle
ysakotch
0
210
KnowledgeBaseDocuments APIでベクトルインデックス管理を自動化する
iidaxs
1
270
多領域インシデントマネジメントへの挑戦:ハードウェアとソフトウェアの融合が生む課題/Challenge to multidisciplinary incident management: Issues created by the fusion of hardware and software
bitkey
PRO
2
110
大幅アップデートされたRagas v0.2をキャッチアップ
os1ma
2
540
Wantedly での Datadog 活用事例
bgpat
1
520
社外コミュニティで学び社内に活かす共に学ぶプロジェクトの実践/backlogworld2024
nishiuma
0
270
KubeCon NA 2024 Recap / Running WebAssembly (Wasm) Workloads Side-by-Side with Container Workloads
z63d
1
250
Featured
See All Featured
Building a Scalable Design System with Sketch
lauravandoore
460
33k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
665
120k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
132
33k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
2
290
Agile that works and the tools we love
rasmusluckow
328
21k
Mobile First: as difficult as doing things right
swwweet
222
9k
Building Adaptive Systems
keathley
38
2.3k
Visualization
eitanlees
146
15k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
169
50k
Scaling GitHub
holman
458
140k
GraphQLの誤解/rethinking-graphql
sonatard
67
10k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None