Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
96
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
390
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
開発組織のための セキュアコーディング研修の始め方
flatt_security
3
2.4k
Nekko Cloud、 これまでとこれから ~学生サークルが作る、 小さなクラウド
logica0419
2
980
ホワイトボードチャレンジ 説明&実行資料
ichimichi
0
130
「海外登壇」という 選択肢を与えるために 〜Gophers EX
logica0419
0
710
2/18/25: Java meets AI: Build LLM-Powered Apps with LangChain4j
edeandrea
PRO
0
130
偶然 × 行動で人生の可能性を広げよう / Serendipity × Action: Discover Your Possibilities
ar_tama
1
1.1k
リーダブルテストコード 〜メンテナンスしやすい テストコードを作成する方法を考える〜 #DevSumi #DevSumiB / Readable test code
nihonbuson
11
7.3k
JEDAI Meetup! Databricks AI/BI概要
databricksjapan
0
150
エンジニアが加速させるプロダクトディスカバリー 〜最速で価値ある機能を見つける方法〜 / product discovery accelerated by engineers
rince
4
380
PHPで印刷所に入稿できる名札データを作る / Generating Print-Ready Name Tag Data with PHP
tomzoh
0
110
クラウドサービス事業者におけるOSS
tagomoris
2
860
2.5Dモデルのすべて
yu4u
2
880
Featured
See All Featured
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
33
2.8k
[RailsConf 2023] Rails as a piece of cake
palkan
53
5.2k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
226
22k
VelocityConf: Rendering Performance Case Studies
addyosmani
328
24k
BBQ
matthewcrist
87
9.5k
Practical Orchestrator
shlominoach
186
10k
Scaling GitHub
holman
459
140k
4 Signs Your Business is Dying
shpigford
182
22k
How to train your dragon (web standard)
notwaldorf
91
5.8k
Into the Great Unknown - MozCon
thekraken
35
1.6k
GitHub's CSS Performance
jonrohan
1030
460k
YesSQL, Process and Tooling at Scale
rocio
172
14k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None