Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
250
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.4k
#nr16: Recherche-Tools
pudo
1
93
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
230
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
370
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
DynamoDB でスロットリングが発生したとき_大盛りver/when_throttling_occurs_in_dynamodb_long
emiki
1
430
CysharpのOSS群から見るModern C#の現在地
neuecc
2
3.5k
『Firebase Dynamic Links終了に備える』 FlutterアプリでのAdjust導入とDeeplink最適化
techiro
0
130
第1回 国土交通省 データコンペ参加者向け勉強会③- Snowflake x estie編 -
estie
0
130
OCI Security サービス 概要
oracle4engineer
PRO
0
6.5k
B2B SaaSから見た最近のC#/.NETの進化
sansantech
PRO
0
890
Adopting Jetpack Compose in Your Existing Project - GDG DevFest Bangkok 2024
akexorcist
0
110
OCI Vault 概要
oracle4engineer
PRO
0
9.7k
リンクアンドモチベーション ソフトウェアエンジニア向け紹介資料 / Introduction to Link and Motivation for Software Engineers
lmi
4
300k
Taming you application's environments
salaboy
0
190
AI前提のサービス運用ってなんだろう?
ryuichi1208
8
1.4k
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
3.2k
Featured
See All Featured
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
16
2.1k
How to Think Like a Performance Engineer
csswizardry
20
1.1k
Gamification - CAS2011
davidbonilla
80
5k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
44
2.2k
How GitHub (no longer) Works
holman
310
140k
Faster Mobile Websites
deanohume
305
30k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
131
33k
RailsConf 2023
tenderlove
29
900
Facilitating Awesome Meetings
lara
50
6.1k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.1k
Bash Introduction
62gerente
608
210k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
28
8.2k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None