Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Let it crash - fault tolerance in Elixir/OTP
Search
Maciej Kaszubowski
September 28, 2017
Programming
0
460
Let it crash - fault tolerance in Elixir/OTP
Maciej Kaszubowski
September 28, 2017
Tweet
Share
More Decks by Maciej Kaszubowski
See All by Maciej Kaszubowski
Error-free Elixir
mkaszubowski
0
320
Modular Design in Elixir (ElixirConf EU 2019)
mkaszubowski
2
720
The Big Ball of Nouns
mkaszubowski
0
95
Modular Design in Elixir
mkaszubowski
1
380
Our three years with Elixir
mkaszubowski
0
240
Concurrency Basics for Elixir
mkaszubowski
0
120
Distributed Elixir
mkaszubowski
0
140
Software Architecture
mkaszubowski
0
130
CRDTs - The science behind Phoenix Presence
mkaszubowski
2
260
Other Decks in Programming
See All in Programming
セキュリティマネジャー廃止とクラウドネイティブ型サンドボックス活用
kazumura
1
170
人には人それぞれのサービス層がある
shimabox
3
670
赤裸々に公開。 TSKaigiのオフシーズン
takezoux2
0
130
KotlinConf 2025 現地で感じたServer-Side Kotlin
n_takehata
1
200
Javaのルールをねじ曲げろ!禁断の操作とその代償から学ぶメタプログラミング入門 / A Guide to Metaprogramming: Lessons from Forbidden Techniques and Their Price
nrslib
3
1.9k
Datadog RUM 本番導入までの道
shinter61
1
250
既存デザインを変更せずにタップ領域を広げる方法
tahia910
1
130
Development of an App for Intuitive AI Learning - Blockly Summit 2025
teba_eleven
0
110
SODA - FACT BOOK
sodainc
1
840
生成AIで日々のエラー調査を進めたい
yuyaabo
0
520
Beyond Portability: Live Migration for Evolving WebAssembly Workloads
chikuwait
0
340
データベースコネクションプール(DBCP)の変遷と理解
fujikawa8
1
250
Featured
See All Featured
Writing Fast Ruby
sferik
628
61k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
Reflections from 52 weeks, 52 projects
jeffersonlam
349
20k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
32
5.9k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
137
34k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Building a Modern Day E-commerce SEO Strategy
aleyda
41
7.3k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
BBQ
matthewcrist
89
9.7k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
Transcript
LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir
Metup #4
(YOU CAN ASK QUESTIONS)
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
FAULT TOLERANCE
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
LET IT CRASH!
Let it crash!
Let it crash! ‣ Accept the fact that things fail
‣ Focus on the happy path ‣ Make failures more predictable
Let it crash! ‣ Separate the logic and error handling
‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html
THE TOOLS
Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣
Distribution
Monitors pid_a ref = Process.monitor(pid_b) pid_b
Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}
pid_b
Links pid_a Process.link(pid_b) pid_b
Links pid_a pid_b Process.link(pid_b)
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor Worker *New* process
Supervision strategies
opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)
:one_for_one W S W W S W W S W
:all_for_one W S W W S W W S W
W S W
:rest_for_one W S W W W S W W W
S W W W S W W
:simple_one_for_one W S W W S W W S W
Heart ## vm.args ## Heartbeat management; auto-restarts VM if it
##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
WHY RESTARTING WORKS
Why restarting works ‣ Independent processes ‣ Clean state ‣
Bohrbugs vs. Heisenbugs
Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to
fix ‣ Rare in production ‣ Restarting doesn't help
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Supervisors Worker Worker Supervisor Worker *New* process
New process ‣ Clean state ‣ Predictable ‣ High chance
of fixing the bug
LIMITS
Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:
1 ] Supervisor.start_link(children, opts) Limits
‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting
MISTAKES
‣ Poor supervision tree structure ‣ Not validating user params
‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
‣ Trying to recreate the state ‣ Timeouts ‣ Not
reading libraries source code ‣ Incorrect limits Mistakes
Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =
UserService.fetch_by_id(params["id"])
Restoring the state def init(_) do state = restore_state() {:ok,
state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
Poor supervision structure
Stable, long-lived, important, protected Short-lived, transient, can fail
Incorrect limits
DEMO!
BENEFITS
‣ Less code (= less bugs, easier to understand, easier
to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
Less code
def update_name(user, name) do end
def update_name(user, name) do update(user, %{name: name}) end
def update_name(user, name) do case update(user, %{name: name}) do end
end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
‣ What is likely to happen? ‣ What is an
acceptable error? ‣ What do I know how to handle?
def update_name(user, name) do {:ok, _} = update(user, %{name: name})
do end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, %{errors: [username: "cannot be blank"]}} {:error, :blank_username} end end Acceptable error
None
def update_description(transaction, user) do with \ %{receipt: receipt} transaction,
false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn %{"id" transaction_id}
= Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
Less duplicated logic
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}
{:ok, _} = %Contact{} Contact.Changeset(params) Repo.insert() end
Faster bug fixes
def handle_info(:do_work, state) do with {:ok, data} ServiceA.fetch_data(), {:ok,
other_data} ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}
= ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,
4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
iex(4)> with {:ok, data} ServiceA.fetch_data, do: :ok [1, 2,
3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
[error] GenServer Fail.Worker terminating ** (MatchError) no match of right
hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
SUMMARY
‣ Things will fail ‣ Fault tolerance isn't free ‣
Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/
‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
THANK YOU! mkaszubowski94 http://mkaszubowski.pl