Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
410
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
98
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
400
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
AWSの新機能検証をやる時こそ、Amazon Qでプロンプトエンジニアリングを駆使しよう
duelist2020jp
1
330
【Oracle Cloud ウェビナー】ご希望のクラウドでOracle Databaseを実行〜マルチクラウド・ソリューション徹底解説〜
oracle4engineer
PRO
1
140
読んで学ぶ Amplify Gen2 / Amplify と CDK の関係を紐解く #jawsug_tokyo
tacck
PRO
1
300
10分で学ぶ、RAGの仕組みと実践
supermarimobros
0
750
社会人力と研究力ー博士号をキャリアの武器にするー
kentaro
2
100
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
5.4k
企業が押さえるべきMCPの未来
takaakikakei
0
260
Асинхронная коммуникация в Go: от понятного к душному. Дима Некрасов, Otello, 2ГИС
lamodatech
0
1.7k
AndroidアプリエンジニアもMCPを触ろう
kgmyshin
2
580
Aspire をカスタマイズしよう & Aspire 9.2
nenonaninu
0
360
DjangoCon Europe 2025 Keynote - Django for Data Science
wsvincent
0
440
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
2
450
Featured
See All Featured
Being A Developer After 40
akosma
91
590k
Practical Orchestrator
shlominoach
187
11k
Stop Working from a Prison Cell
hatefulcrawdad
268
20k
How to train your dragon (web standard)
notwaldorf
91
6k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
690
Scaling GitHub
holman
459
140k
Done Done
chrislema
184
16k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
60k
Automating Front-end Workflow
addyosmani
1370
200k
A better future with KSS
kneath
239
17k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
780
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None