Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
290
2
Share
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.7k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
180
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
260
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
460
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
ラズパイ & Picoで入門:Zephyr(RTOS)の環境構築からビルドまでの紹介
iotengineer22
0
190
Terragrunt x Snowflake + dbt で作るマルチテナントなデータ基盤構築プラットフォーム
gak_t12
0
530
GitHub Copilot CLI の Rubber Duck 機能を使ってコーディングの品質をあげよう #techbaton_findy
stefafafan
2
500
TSKaigi 2026 - Auth.jsからBetter Authへの 移行に見る「型とランタイム」の 設計思想の変化
teamlab
PRO
1
120
ECSのTerraformモジュールにコントリビュートした話
harukasakihara
0
310
SDDで⾒える、AIコーディングの"内訳"
lycorptech_jp
PRO
0
210
GCASアップデート(202603-202605)
techniczna
0
260
LT準備のToilを削減 〜決定論×確率論のスライド生成CLI〜
shukob
0
120
ワールドカフェ再び、そしてゴール・ルール・ロール・ツール / World Café Revisited, and the Goals-Rules-Roles-Tools
ks91
PRO
0
190
React Compiler導入の効果と運用の工夫
kakehashi
PRO
3
320
Cortex(Code) を ML モデルの 精度改善サイクルに組み込む.pdf
oimo23
0
260
論文紹介:Pixal3D (SIGGRAPH 2026)
tenten0727
0
680
Featured
See All Featured
Done Done
chrislema
186
16k
Joys of Absence: A Defence of Solitary Play
codingconduct
1
370
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
750
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
150
Navigating Weather and Climate Data
rabernat
0
190
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.6k
How GitHub (no longer) Works
holman
316
150k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
23k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
350
Are puppies a ranking factor?
jonoalderson
1
3.4k
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.4k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None