Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Cassandra for Data Analytics Backends
Search
αλεx π
September 24, 2015
Research
460
7
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Cassandra for Data Analytics Backends
αλεx π
September 24, 2015
More Decks by αλεx π
See All by αλεx π
Scalable Time Series With Cassandra
ifesdjeen
1
420
Bayesian Inference is known to make machines biased
ifesdjeen
2
400
Stream Processing and Functional Programming
ifesdjeen
1
790
PolyConf 2015 - Rocking the Time Series boat with C, Haskell and ClojureScript
ifesdjeen
0
520
Clojure - A Sweetspot for Analytics
ifesdjeen
8
2.1k
Going Off Heap
ifesdjeen
3
1.9k
Always be learning
ifesdjeen
1
190
Learn Yourself Emacs For Great Good workshop slides
ifesdjeen
3
350
What Reading 5 Papers can yield for your Business
ifesdjeen
0
390
Other Decks in Research
See All in Research
2026年度 生成AI を活用した論文執筆ガイド/ワークショップ / 2026 Academic Year Guide to Writing Papers Using Generative AI - Workshop
ks91
PRO
0
180
Sequences of Logits Reveal the Low Rank Structure of Language Models
sansantech
PRO
1
270
NII S. Koyama's Lab Research Overview AY2026
skoyamalab
0
340
衛星×エッジAI勉強会 衛星上におけるAI処理制約とそ取組について
satai
4
560
論文紹介:HalluCitation Matters
wasyro
0
100
Dual Quadric表現を用いた動的物体追跡とRGB-D・IMU制約の密結合によるオドメトリ推定
nanoshimarobot
0
420
Spatial Active Noise Control Based onSound Field Interpolation Incorporating Physical Constraints
skoyamalab
0
110
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
satai
3
880
2026年1月の生成AI領域の重要リリース&トピック解説
kajikent
0
1k
Anthropic が提案する LLM の内部状態を自然言語で説明可能にした Natural Language Autoencoders / Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
shunk031
0
130
世界モデルにおける分布外データ対応の方法論
koukyo1994
7
2.2k
National high-resolution cropland classification of Japan with agricultural census information and multi-temporal multi-modality datasets
satai
3
310
Featured
See All Featured
Designing for humans not robots
tammielis
254
26k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
210
Building AI with AI
inesmontani
PRO
1
1.1k
How to make the Groovebox
asonas
2
2.2k
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
200
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
23k
ラッコキーワード サービス紹介資料
rakko
1
3.7M
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
360
30k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
400
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.9k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.6k
Transcript
@ifesdjeen
Cassandra Monitoring
None
Precision
is not same as
Semantics
is not same as
Anomaly detection
Do you see the elephant being swallowed by the snake?
Agenda
Ad-hoc queries
Aggregations Fast
Machine Learning
parallel queries Step 1
+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+
Used to avoid timestamp resolution collisions To ensure sub-resolution order
Snapshot the data on overflow or timeout Ensures idempotence Sequence ID
Fighting Dispersion
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13 Range Tables
Full Table Scan ts1 ts2 ts3 ts4 ts5 ts6 ts7
ts8 ts9 ts10 ts11 ts12 ts13 Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
Open Range Start End ts1 ts2 ts3 ts4 ts5 ts6
ts7 ts8 ts9 ts10 ts11 ts12 ts13
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
“Between” Range ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8
ts9 ts10 ts11 ts12 ts13 Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
(rich query API) Step 2 add some algebra
None
Stream Fusion for rich ad-hoc queries
What is even Stream Fusion
map filter reduce
single step mapFilterReduce
data Step data cursor = Yield data !cursor | Skip
!cursor | Done data Stream data = ∃s. Stream (cursor → Step data cursor) cursor
Stream Beginning: reading from the DB
map Yield data cursor → Yield (f cursor) cursor Skip
cursor → Skip cursor Done → Done maps :: (a → b) → Stream a → Stream b
filter Yield data cursor | p data → Yield data
cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done filters :: (a → Bool) → Stream a → Stream a
reduce/fold Yield x cursor → loop (f data x) cursor
Skip cursor → loop data cursor Done → z foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc
Append class Monoid a where mempty :: a mappend ::
a -> a -> a -- ^ Identity of 'mappend' -- ^ An associative operation
class (Monoid intermediate) => Aggregate intermediate end where combine ::
intermediate -> end Combine
data Count = Count Int instance Monoid Count where mempty
= Count 0 mappend (Count a) (Count b) = Count $ a + b instance Aggregate Count Int where combine (Count a) = a Count Example
add some ML Step 3
Storing Models
Support Vector Machines
Hyperplane α·x - φ = 1
[ α1 α1 α1 ...αn ] ρ
Option 1: list<double>
CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY
KEY(path))
Problems High deserialisation overhead Need to add PK specifiers for
multiple SVs
Alternative: blob & byte buffers
Vector Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+ byte address points 1 2 3 4 0 n
Matrix Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ m1 m2 m3 m4 m0 mn
Advantages No serialisation overhead Fast relative access Easy to go
multi-dimensional Easy to implement atomic in-memory operations
Bayesian Classifiers
P(X | blue)= Number of Blue near X Total number
of blue P(X | red)= Number of Red near X Total number of Red
[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)] ... [Mean(xn), Var(xn)]]
0 8 16 +---------+---------+ | Mean(x )| Var(x ) |
+---------+---------+ 0 0 16 24 32 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+ 1 1 2n*8 (2n+1)*8 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+ n n byte address payloads
make it rocket-fast Step 4
Approximate Data Structures
Bloom Filters are basically long arrays / vectors
BitSet
0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ bit address
Advantages 64 bits per 8-byte Long Easy to represent by
the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations
Count-min sketches are basically int matrices
Histograms are basically long vectors
Conclusions Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good
API fits
@ifesdjeen