Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Friedrich Lindenberg
July 16, 2014
Technology
2
290
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
430
Getting started with OCCRP Data
pudo
0
1.6k
#nr16: Recherche-Tools
pudo
1
120
data.occrp.org
pudo
0
170
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
250
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
300
Dr. Freezefile
pudo
2
440
Intro presentation for Naivasha
pudo
1
180
Other Decks in Technology
See All in Technology
コンテキスト・ハーネスエンジニアリングの現在
hirosatogamo
PRO
6
740
AlloyDB 奮闘記
hatappi
0
200
脳が溶けた話 / Melted Brain
keisuke69
1
730
スピンアウト講座06_認証系(API-OAuth-MCP)入門
overflowinc
0
940
ADK + Gemini Enterprise で 外部 API 連携エージェント作るなら OAuth の仕組みを理解しておこう
kaz1437
0
150
既存アプリの延命も,最新技術での新規開発も:WebSphereの最新情報
ktgrryt
0
150
今日から始められるテスト自動化 〜 基礎知識から生成AI活用まで 〜
magicpod
1
120
PostgreSQL 18のNOT ENFORCEDな制約とDEFERRABLEの関係
yahonda
0
100
俺の/私の最強アーキテクチャ決定戦開催 ― チームで新しいアーキテクチャに適合していくために / 20260322 Naoki Takahashi
shift_evolve
PRO
1
410
【PHPerKaigi2026】OpenTelemetry SDKを使ってPHPでAPMを自作する
fendo181
1
150
Phase04_ターミナル基礎
overflowinc
0
1.9k
Kiro Powers 入門
k_adachi_01
0
140
Featured
See All Featured
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.2k
My Coaching Mixtape
mlcsv
0
84
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
210
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
SEO for Brand Visibility & Recognition
aleyda
0
4.4k
Into the Great Unknown - MozCon
thekraken
40
2.3k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
180
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
160
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8k
The SEO identity crisis: Don't let AI make you average
varn
0
420
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
240
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None