Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
290
2
Share
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.7k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
180
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
260
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
450
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
データ定義の混乱と戦う 〜 管理会計と財務会計 〜
wonohe
0
150
『生成AI時代のクレデンシャルとパーミッション設計 — Claude Code を起点に』の執筆企画
takuros
1
770
Percolatorを廃止し、マルチ検索サービスへ刷新した話 / Search Engineering Tech Talk 2026 Spring
visional_engineering_and_design
0
170
AIが自律的に働く時代へ Amazon Quick で実現するAIエージェント紹介
koheiyoshikawa
0
140
VespaのParent Childを用いたフィードパフォーマンスの改善
taking
0
130
[最強DB講義]推薦システム | 評価編
recsyslab
PRO
0
110
AgentCore×VPCでの設計パターンn選と勘所
har1101
4
340
エージェントスキルを作って自分のインプットに役立てよう
tsubakimoto_s
0
470
「誰一人取り残されない」 AIエージェント時代のプロダクト設計思想 Product Management Summit 2026
mizushimac
1
1.9k
Good Enough Types: Heuristic Type Inference for Ruby
riseshia
1
330
小説執筆のハーネスエンジニアリング
yoshitetsu
0
820
"おまじない"を卒業する ボイラープレート再入門
shunsuke_1b
1
120
Featured
See All Featured
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
230
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Music & Morning Musume
bryan
47
7.2k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.9k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.5k
4 Signs Your Business is Dying
shpigford
187
22k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4k
The Invisible Side of Design
smashingmag
303
52k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
The untapped power of vector embeddings
frankvandijk
2
1.7k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
250
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None