Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Let it crash - fault tolerance in Elixir/OTP
Search
Maciej Kaszubowski
September 28, 2017
Programming
0
480
Let it crash - fault tolerance in Elixir/OTP
Maciej Kaszubowski
September 28, 2017
Tweet
Share
More Decks by Maciej Kaszubowski
See All by Maciej Kaszubowski
Error-free Elixir
mkaszubowski
0
340
Modular Design in Elixir (ElixirConf EU 2019)
mkaszubowski
2
740
The Big Ball of Nouns
mkaszubowski
0
110
Modular Design in Elixir
mkaszubowski
1
390
Our three years with Elixir
mkaszubowski
0
250
Concurrency Basics for Elixir
mkaszubowski
0
120
Distributed Elixir
mkaszubowski
0
160
Software Architecture
mkaszubowski
0
140
CRDTs - The science behind Phoenix Presence
mkaszubowski
2
270
Other Decks in Programming
See All in Programming
Compose Multiplatform × AI で作る、次世代アプリ開発支援ツールの設計と実装
thagikura
0
160
速いWebフレームワークを作る
yusukebe
5
1.7k
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
380
プロパティベーステストによるUIテスト: LLMによるプロパティ定義生成でエッジケースを捉える
tetta_pdnt
0
1.7k
実用的なGOCACHEPROG実装をするために / golang.tokyo #40
mazrean
1
280
Processing Gem ベースの、2D レトロゲームエンジンの開発
tokujiros
2
130
モバイルアプリからWebへの横展開を加速した話_Claude_Code_実践術.pdf
kazuyasakamoto
0
330
Performance for Conversion! 分散トレーシングでボトルネックを 特定せよ
inetand
0
860
AI Coding Agentのセキュリティリスク:PRの自己承認とメルカリの対策
s3h
0
230
Ruby×iOSアプリ開発 ~共に歩んだエコシステムの物語~
temoki
0
320
旅行プランAIエージェント開発の裏側
ippo012
2
910
個人軟體時代
ethanhuang13
0
320
Featured
See All Featured
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
Rails Girls Zürich Keynote
gr2m
95
14k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
Raft: Consensus for Rubyists
vanstee
140
7.1k
What's in a price? How to price your products and services
michaelherold
246
12k
Optimizing for Happiness
mojombo
379
70k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
Code Review Best Practice
trishagee
70
19k
GraphQLとの向き合い方2022年版
quramy
49
14k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3k
Transcript
LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Poznań Elixir Metup #4
(DON'T) LET IT CRASH! Fault tolerance in Elixir/OTP Poznań Elixir
Metup #4
(YOU CAN ASK QUESTIONS)
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
FAULT TOLERANCE
Elixir (Erlang) features ‣ Concurrent ‣ Functional ‣ Immutable state
‣ Message passing ‣ Distributed ‣ Hot upgrades
LET IT CRASH!
Let it crash!
Let it crash! ‣ Accept the fact that things fail
‣ Focus on the happy path ‣ Make failures more predictable
Let it crash! ‣ Separate the logic and error handling
‣ When something is wrong, let the process crash and let another one handle it (e.g. by restarting)
https://ferd.ca/an-open-letter-to-the-erlang-beginner-or-onlooker.html
THE TOOLS
Tools ‣ Monitors ‣ Links ‣ Supervisors ‣ Heart ‣
Distribution
Monitors pid_a ref = Process.monitor(pid_b) pid_b
Monitors pid_a ref = Process.monitor(pid_b) {:DOWN, ref, :process, pid_b, reason}
pid_b
Links pid_a Process.link(pid_b) pid_b
Links pid_a pid_b Process.link(pid_b)
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a {:EXIT, from, reason}
Links pid_b Process.link(pid_b) Process.flag(:trap_exit, true) pid_a
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor
Supervisors Worker Worker Supervisor Worker *New* process
Supervision strategies
opts = [ name: MyApp.Supervisor, ] Supervisor.start_link(children, opts)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one ] Supervisor.start_link(children, opts)
:one_for_one W S W W S W W S W
:all_for_one W S W W S W W S W
W S W
:rest_for_one W S W W W S W W W
S W W W S W W
:simple_one_for_one W S W W S W W S W
Heart ## vm.args ## Heartbeat management; auto-restarts VM if it
##dies or becomes unresponsive ## (Disabled by default use with caution!) -heart -env HEART_COMMAND ~/heart_command.sh
WHY RESTARTING WORKS
Why restarting works ‣ Independent processes ‣ Clean state ‣
Bohrbugs vs. Heisenbugs
Bohrbugs ‣ Repeatable ‣ Easy to debug ‣ Easy to
fix ‣ Rare in production ‣ Restarting doesn't help
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Heisenbugs ‣ Unpredictable ‣ Hard to debug ‣ Hard to
fix ‣ Frequent in production ‣ Restarting HELPS!
Supervisors Worker Worker Supervisor Worker *New* process
New process ‣ Clean state ‣ Predictable ‣ High chance
of fixing the bug
LIMITS
Limits ‣ :max_restarts (default: 3) ‣ :max_seconds (default: 5)
opts = [ name: MyApp.Supervisor, strategy: :one_for_one, max_restarts: 1, max_seconds:
1 ] Supervisor.start_link(children, opts) Limits
‣ Process ‣ Supervisor ‣ Node ‣ Machine Restarting
MISTAKES
‣ Poor supervision tree structure ‣ Not validating user params
‣ Not handling expected errors ‣ {:error, reason} tuples everywhere Mistakes
‣ Trying to recreate the state ‣ Timeouts ‣ Not
reading libraries source code ‣ Incorrect limits Mistakes
Expected errors {:ok, user} = Auth.authenticate(email, password) {:ok, user} =
UserService.fetch_by_id(params["id"])
Restoring the state def init(_) do state = restore_state() {:ok,
state} end def terminate(_reason, state) do save_state(state) end http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
Poor supervision structure
Stable, long-lived, important, protected Short-lived, transient, can fail
Incorrect limits
DEMO!
BENEFITS
‣ Less code (= less bugs, easier to understand, easier
to change) ‣ Less logic duplication ‣ Faster bug fixes Benefits
Less code
def update_name(user, name) do end
def update_name(user, name) do update(user, %{name: name}) end
def update_name(user, name) do case update(user, %{name: name}) do end
end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, reason} {:error, reason} end end ‣ Do you know how to handle reason? ‣ Is {:error, reason} even possible? ‣ Fatal or acceptable error?
‣ What is likely to happen? ‣ What is an
acceptable error? ‣ What do I know how to handle?
def update_name(user, name) do {:ok, _} = update(user, %{name: name})
do end
def update_name(user, name) do case update(user, %{name: name}) do {:ok,
user} {:ok, user} {:error, %{errors: [username: "cannot be blank"]}} {:error, :blank_username} end end Acceptable error
None
def update_description(transaction, user) do with \ %{receipt: receipt} transaction,
false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn with \ %{receipt:
receipt} transaction, false is_nil(receipt), {:ok, %{"id" id} Poison.decode(receipt), {:ok, %{status: 200, body: body}} Adapter.update(id, user) {:ok, _} update_db_record(id, body) do :ok end end) end
def update_description(transaction, user) do Task.Supervisor.start_child(MyApp.TaskSupervisor, fn %{"id" transaction_id}
= Poison.decode!(receipt) {:ok, %{body: body}} = Adapter.update(transaction_id, user) {:ok, _} = update_db_record(transaction_id, body) end end
Less duplicated logic
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, nil), do: {:error, :invalid_contact_id} def add_contact(current_user_id, contact_id) do
params = %{user_id: current_user_id, contact_id: contact_id} %Contact{} Contact.Changeset(params) Repo.insert() case do {:ok, contact} {:ok, contact} {:error, changeset} {:error, changeset} end end
def add_contact(current_user_id, contact_id) do params = %{user_id: current_user_id, contact_id: contact_id}
{:ok, _} = %Contact{} Contact.Changeset(params) Repo.insert() end
Faster bug fixes
def handle_info(:do_work, state) do with {:ok, data} ServiceA.fetch_data(), {:ok,
other_data} ServiceB.fetch_data() do do_some_work(data, other_data) end Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
def handle_info(:do_work, state) do {:ok, data} = ServiceA.fetch_data() {:ok, other_data}
= ServiceB.fetch_data() :ok = do_some_work(data, other_data) Process.send_after(self(), :do_work, @one_hour) {:noreply, state} end
defmodule ServiceA do def fetch_data() do {:ok, [1, 2, 3,
4, 5]} end end defmodule ServiceA do def fetch_data() do [1, 2, 3, 4, 5] end end
iex(4)> with {:ok, data} ServiceA.fetch_data, do: :ok [1, 2,
3, 4, 5] iex(6)> {:ok, data} = ServiceA.fetch_data() ** (MatchError) no match of right hand side value: [1, 2, 3, 4, 5]
[error] GenServer Fail.Worker terminating ** (MatchError) no match of right
hand side value: [1, 2, 3, 4, 5] (fail) lib/fail/worker.ex:30: Fail.Worker.handle_info/2 (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4 (stdlib) gen_server.erl:681: :gen_server.handle_msg/5 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: :do_work State: nil
SUMMARY
‣ Things will fail ‣ Fault tolerance isn't free ‣
Know your tools ‣ Think what you can handle ‣ Don't try to handle every possible error ‣ Think about supervision structure
‣ https://ferd.ca/the-zen-of-erlang.html ‣ https://medium.com/@jlouis666/error-kernels-9ad991200abd ‣ http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-and- crashes.html ‣ https://mazenharake.wordpress.com/2009/09/14/let-it-crash-the- right-way/
‣ http://blog.plataformatec.com.br/2016/05/beyond-functional- programming-with-elixir-and-erlang/ ‣ https://mazenharake.wordpress.com/2010/10/31/9-erlang-pitfalls- you-should-know-about/ ("Returning arbitrary {error, Reason}") ‣ http://mkaszubowski.pl/2017/09/02/On-Restoring-Process-State.html
THANK YOU! mkaszubowski94 http://mkaszubowski.pl