Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data doesn't grow in tables
Search
Friedrich Lindenberg
July 16, 2014
Technology
2
260
Data doesn't grow in tables
Friedrich Lindenberg
July 16, 2014
Tweet
Share
More Decks by Friedrich Lindenberg
See All by Friedrich Lindenberg
Introducción a OCCRP Data
pudo
0
400
Getting started with OCCRP Data
pudo
0
1.5k
#nr16: Recherche-Tools
pudo
1
94
data.occrp.org
pudo
0
140
Tools for Data Journalism | MediaLab Prado DDJ Workshop
pudo
0
240
Digitial Research Tools for Investigative Reporters
pudo
0
11k
Grano: A Python tool for investigating influence
pudo
1
270
Dr. Freezefile
pudo
2
380
Intro presentation for Naivasha
pudo
1
160
Other Decks in Technology
See All in Technology
Accessibility Inspectorを活用した アプリのアクセシビリティ向上方法
hinakko
0
180
WantedlyでのKotlin Multiplatformの導入と課題 / Kotlin Multiplatform Implementation and Challenges at Wantedly
kubode
0
250
ドメイン駆動設計の実践により事業の成長スピードと保守性を両立するショッピングクーポン
lycorptech_jp
PRO
13
2.2k
[IBM TechXchange Dojo]Watson Discoveryとwatsonx.aiでRAGを実現!座学①
siyuanzh09
0
110
20250116_自部署内でAmazon Nova体験会をやってみた話
riz3f7
1
100
Evolving Architecture
rainerhahnekamp
3
260
なぜfreeeはハブ・アンド・スポーク型の データメッシュアーキテクチャにチャレンジするのか?
shinichiro_joya
2
490
Oracle Exadata Database Service(Dedicated Infrastructure):サービス概要のご紹介
oracle4engineer
PRO
0
12k
Oracle Base Database Service:サービス概要のご紹介
oracle4engineer
PRO
1
16k
2024AWSで個人的にアツかったアップデート
nagisa53
1
110
.NET AspireでAzure Functionsやクラウドリソースを統合する
tsubakimoto_s
0
190
AWS re:Invent 2024 re:Cap Taipei (for Developer): New Launches that facilitate Developer Workflow and Continuous Innovation
dwchiang
0
170
Featured
See All Featured
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Reflections from 52 weeks, 52 projects
jeffersonlam
348
20k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2k
Raft: Consensus for Rubyists
vanstee
137
6.7k
Making the Leap to Tech Lead
cromwellryan
133
9k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Writing Fast Ruby
sferik
628
61k
Thoughts on Productivity
jonyablonski
68
4.4k
Java REST API Framework Comparison - PWX 2021
mraible
28
8.3k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
GraphQLの誤解/rethinking-graphql
sonatard
68
10k
Transcript
Data doesn’t grow in tables Dealing with large sets of
documents
–An investigative reporter “We're working with 40 GB of XXX
and would like to search within the documents for certain keywords (like XXX) so we can identify XXX. Ideally we should be able to tag the docs..”
Some lingo • OCR (Optical Character Recognition) • NLP (Natural
Language Processing) • NER (Named Entity Recognition) • Regular Expressions
Cases
Exhibit A
Exhibit B
Exhibit C
Exhibit D
Tools
Tables in disguise http://tabula.nerdpower.org
Docs in a cloud http://documentcloud.org
Clustering, tagging, mining http://overview.ap.org
Let them eat PDF https://github.com/CrowData
All the visuals Jigsaw
Spoken word magic http://sayit.mysociety.org/
Whats missing? Easy-to-use ElasticSearch Commercial-grade OCR Configurable pipelines
Stefan Wehrmeyer, correctiv.org, @stefanwehrmeyer ! ! ! ! ! !
! Friedrich Lindenberg, codeforafrica.org, @pudo
None
None