Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
It's 10pm: Do You Know Where Your Writes Are?
Search
Jeremy Mikola
October 12, 2017
Programming
0
200
It's 10pm: Do You Know Where Your Writes Are?
Presented October 12, 2017 at MongoDB.local San Francisco.
Jeremy Mikola
October 12, 2017
Tweet
Share
More Decks by Jeremy Mikola
See All by Jeremy Mikola
PHP Internals for the Inquisitive Developer
jmikola
1
640
Bulletproof MongoDB
jmikola
0
470
Zero to Sixty with MongoDB
jmikola
3
1.1k
DOs and DON’Ts of MongoDB
jmikola
13
3.1k
Five Years of Beta
jmikola
0
140
Rethinking Extension Development for PHP and HHVM
jmikola
2
810
What's New in MongoDB 3.2
jmikola
0
110
Async PHP with React
jmikola
28
11k
NoSQL Lightning Talks (MongoDB, Cassandra, MySQL)
jmikola
1
240
Other Decks in Programming
See All in Programming
deno-redisの紹介とJSRパッケージの運用について (toranoana.deno #21)
uki00a
0
140
iOSアプリ開発で 関数型プログラミングを実現する The Composable Architectureの紹介
yimajo
2
210
AWS CDKの推しポイント 〜CloudFormationと比較してみた〜
akihisaikeda
3
310
datadog dash 2025 LLM observability for reliability and stability
ivry_presentationmaterials
0
110
第9回 情シス転職ミートアップ 株式会社IVRy(アイブリー)の紹介
ivry_presentationmaterials
1
230
Julia という言語について (FP in Julia « SIDE: F ») for 関数型まつり2025
antimon2
3
980
AIエージェントはこう育てる - GitHub Copilot Agentとチームの共進化サイクル
koboriakira
0
340
イベントストーミング図からコードへの変換手順 / Procedure for Converting Event Storming Diagrams to Code
nrslib
1
320
A2A プロトコルを試してみる
azukiazusa1
2
1.1k
明示と暗黙 ー PHPとGoの インターフェイスの違いを知る
shimabox
2
290
関数型まつりレポート for JuliaTokai #22
antimon2
0
150
FormFlow - Build Stunning Multistep Forms
yceruto
1
190
Featured
See All Featured
Rebuilding a faster, lazier Slack
samanthasiow
81
9.1k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
Visualization
eitanlees
146
16k
Product Roadmaps are Hard
iamctodd
PRO
53
11k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
48
5.4k
Docker and Python
trallard
44
3.4k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
It's Worth the Effort
3n
185
28k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.8k
Fireside Chat
paigeccino
37
3.5k
KATA
mclloyd
29
14k
Faster Mobile Websites
deanohume
307
31k
Transcript
It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola
It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola
On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp
Retryable Writes
Retryable Writes
Retryable Writes
You’re updating a document
You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Murphy’s Law kicks in
Is it safe to retry the update?
Did our message never make it to the server?
Did we lose the server’s reply?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
There’s no way to retrieve an operation’s state
Let’s review some best practices How To Write Resilient MongoDB
Applications
This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
There’s no good solution for transient network errors
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK OK
Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );
Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );
Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );
Why can’t we retrieve an operation’s state?
In MongoDB 3.4, state is tied to connection objects
MongoDB 3.6 introduces logical sessions
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
We can trust the server to Do the Right Thing™
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed.
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
Sessions are cluster-wide
Sessions are cluster-wide update
Sessions are cluster-wide update
Sessions are cluster-wide
Sessions are cluster-wide
Sessions are cluster-wide update
Taking advantage of retryable writes ?retryWrites=true mongodb://…
One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Zombie Cursor Cleanup
Zombie Cursor Cleanup
You’re running a long query
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
Cursors have a timeout
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
None
None
None
None
A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1
What happened last night?
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) find
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
Avoiding zombie cursors with logical sessions
Avoiding zombie cursors with logical sessions Sessions also have a
timeout.
Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session
Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );
Executing our long query
Executing our long query
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query
Executing our long query session expires
Executing our long query
Did we just punt on the timeout issue?
Session timeouts are non-negotiable
Session timeouts are non-negotiable Idle sessions will expire.
Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.
Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Cluster-wide killOp
Cluster-wide killOp
You’re running an operation that may never complete
You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );
You’ve made a terrible mistake
Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }
Lather, rinse, repeat
Lather, rinse, repeat > connect("mongodb://shard-2.example.com")
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
Cluster-wide killOp with logical sessions
Cluster-wide killOp with logical sessions Any operation may be associated
with a session.
Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.
Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp
One last point
Resilence is primarily the driver’s domain
Resilence is primarily the driver’s domain Server discovery and monitoring
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
Addressing resilence on the server-side
Addressing resilence on the server-side Tracking operation state
Addressing resilence on the server-side Tracking operation state Cluster-wide sessions
Providing a relatively easy upgrade path
Providing a relatively easy upgrade path No need to rewrite
applications
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects Pass session option as needed
Inside the spec process mongodb/specifications
Inside the spec process /sessions mongodb/specifications
Inside the spec process /sessions /retryable-writes mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications
In the meantime… How To Write Resilient MongoDB Applications
Thanks!