Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
270
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
410
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
100
data.occrp.org
pudo
0
150
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
280
Dr. Freezefile
pudo
2
400
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
バクラクによるコーポレート業務の自動運転 #BetAIDay
layerx
PRO
1
710
AIに全任せしないコーディングとマネジメント思考
kikuchikakeru
0
390
AI人生苦節10年で会得したAIがやること_人間がやること.pdf
shibuiwilliam
1
260
恐怖!テストコードなき夜
tsukuboshi
2
110
大規模イベントに向けた ABEMA アーキテクチャの遍歴 ~ Platform Strategy 詳細解説 ~
nagapad
0
160
少人数でも回る! DevinとPlaybookで支える運用改善
ishikawa_pro
5
2.1k
相互運用可能な学修歴クレデンシャルに向けた標準技術と国際動向
fujie
0
180
AWS re:Inforce 2025 re:Cap Update Pickup & AWS Control Tower の運用における考慮ポイント
htan
1
170
オブザーバビリティプラットフォーム開発におけるオブザーバビリティとの向き合い / Hatena Engineer Seminar #34 オブザーバビリティの実現と運用編
arthur1
0
310
興味の胞子を育て 業務と技術に広がる”きのこ力”
fumiyasac0921
0
540
生成AI時代におけるAI・機械学習技術を用いたプロダクト開発の深化と進化 #BetAIDay
layerx
PRO
1
870
Wasmで社内ツールを作って配布しよう
askua
0
180
Featured
See All Featured
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1k
Gamification - CAS2011
davidbonilla
81
5.4k
Rebuilding a faster, lazier Slack
samanthasiow
83
9.1k
Adopting Sorbet at Scale
ufuk
77
9.5k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
21
1.4k
Code Reviewing Like a Champion
maltzj
524
40k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
RailsConf 2023
tenderlove
30
1.2k
The Straight Up "How To Draw Better" Workshop
denniskardys
235
140k
How STYLIGHT went responsive
nonsquared
100
5.7k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
50
5.5k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
770
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None