Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Let it crash - fault tolerance in Elixir/OTP
Search
Maciej Kaszubowski
September 28, 2017
Programming
0
480
Let it crash - fault tolerance in Elixir/OTP
Maciej Kaszubowski
September 28, 2017
Tweet
Share
More Decks by Maciej Kaszubowski
See All by Maciej Kaszubowski
Error-free Elixir
mkaszubowski
0
340
Modular Design in Elixir (ElixirConf EU 2019)
mkaszubowski
2
740
The Big Ball of Nouns
mkaszubowski
0
110
Modular Design in Elixir
mkaszubowski
1
390
Our three years with Elixir
mkaszubowski
0
250
Concurrency Basics for Elixir
mkaszubowski
0
120
Distributed Elixir
mkaszubowski
0
160
Software Architecture
mkaszubowski
0
140
CRDTs - The science behind Phoenix Presence
mkaszubowski
2
270
Other Decks in Programming
See All in Programming
HTMLの品質ってなんだっけ? “HTMLクライテリア”の設計と実践
unachang113
4
2.8k
デザイナーが Androidエンジニアに 挑戦してみた
874wokiite
0
320
アルテニア コンサル/ITエンジニア向け 採用ピッチ資料
altenir
0
100
AIと私たちの学習の変化を考える - Claude Codeの学習モードを例に
azukiazusa1
10
3.9k
CloudflareのChat Agent Starter Kitで簡単!AIチャットボット構築
syumai
2
480
Swift Updates - Learn Languages 2025
koher
2
470
Zendeskのチケットを Amazon Bedrockで 解析した
ryokosuge
3
300
AIでLINEスタンプを作ってみた
eycjur
1
230
[FEConf 2025] 모노레포 절망편, 14개 레포로 부활하기까지 걸린 1년
mmmaxkim
0
1.6k
testingを眺める
matumoto
1
140
開発チーム・開発組織の設計改善スキルの向上
masuda220
PRO
20
11k
アセットのコンパイルについて
ojun9
0
120
Featured
See All Featured
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
The Language of Interfaces
destraynor
161
25k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.4k
GraphQLの誤解/rethinking-graphql
sonatard
72
11k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
920
Fireside Chat
paigeccino
39
3.6k
Making the Leap to Tech Lead
cromwellryan
135
9.5k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
Why You Should Never Use an ORM
jnunemaker
PRO
59
9.5k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
51
5.6k
Transcript
LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir
Metup #4
(YOU CAN ASK QUESTIONS)
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
FAULT TOLERANCE
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
LET IT CRASH!
Let it crash!
Let it crash! ‣ Accept the fact that things fail
‣ Focus on the happy path ‣ Make failures more predictable
Let it crash! ‣ Separate the logic and error handling
‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html
THE TOOLS
Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣
Distribution
Monitors pid_a ref = Process.monitor(pid_b) pid_b
Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}
pid_b
Links pid_a Process.link(pid_b) pid_b
Links pid_a pid_b Process.link(pid_b)
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor Worker *New* process
Supervision strategies
opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)
:one_for_one W S W W S W W S W
:all_for_one W S W W S W W S W
W S W
:rest_for_one W S W W W S W W W
S W W W S W W
:simple_one_for_one W S W W S W W S W
Heart ## vm.args ## Heartbeat management; auto-restarts VM if it
##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
WHY RESTARTING WORKS
Why restarting works ‣ Independent processes ‣ Clean state ‣
Bohrbugs vs. Heisenbugs
Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to
fix ‣ Rare in production ‣ Restarting doesn't help
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Supervisors Worker Worker Supervisor Worker *New* process
New process ‣ Clean state ‣ Predictable ‣ High chance
of fixing the bug
LIMITS
Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:
1 ] Supervisor.start_link(children, opts) Limits
‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting
MISTAKES
‣ Poor supervision tree structure ‣ Not validating user params
‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
‣ Trying to recreate the state ‣ Timeouts ‣ Not
reading libraries source code ‣ Incorrect limits Mistakes
Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =
UserService.fetch_by_id(params["id"])
Restoring the state def init(_) do state = restore_state() {:ok,
state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
Poor supervision structure
Stable, long-lived, important, protected Short-lived, transient, can fail
Incorrect limits
DEMO!
BENEFITS
‣ Less code (= less bugs, easier to understand, easier
to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
Less code
def update_name(user, name) do end
def update_name(user, name) do update(user, %{name: name}) end
def update_name(user, name) do case update(user, %{name: name}) do end
end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
‣ What is likely to happen? ‣ What is an
acceptable error? ‣ What do I know how to handle?
def update_name(user, name) do {:ok, _} = update(user, %{name: name})
do end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, %{errors: [username: "cannot be blank"]}} {:error, :blank_username} end end Acceptable error
None
def update_description(transaction, user) do with \ %{receipt: receipt} transaction,
false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn %{"id" transaction_id}
= Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
Less duplicated logic
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}
{:ok, _} = %Contact{} Contact.Changeset(params) Repo.insert() end
Faster bug fixes
def handle_info(:do_work, state) do with {:ok, data} ServiceA.fetch_data(), {:ok,
other_data} ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}
= ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,
4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
iex(4)> with {:ok, data} ServiceA.fetch_data, do: :ok [1, 2,
3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
[error] GenServer Fail.Worker terminating ** (MatchError) no match of right
hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
SUMMARY
‣ Things will fail ‣ Fault tolerance isn't free ‣
Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/
‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
THANK YOU! mkaszubowski94 http://mkaszubowski.pl