get added/removed all the time ๏ services can discover each other ๏ services talk to each other via RPC/API ๏ machines go down/become unreachable ๏ services crash/become unresponsive ๏ you will see all sorts of weirdness
๏ loose coupling ๏ separation of concerns ๏ It is independently scalable ๏ Easy to write new services ๏ in different languages ๏ Bugs & failure are more contained
release and roll out new versions for a service? how do you reschedule/move services between machines? where do you store application state? how do you move data between machines?
you discover new/removed instances for services? what is the target uptime/SLO (service level objective) for each service? how do you monitor service health? how and when do you alert humans?
it to deploy services? how do you respond to incidents? ๏ how (fast) do you identify the problem? ๏ how (fast) do you mitigate/repair? ๏ how do you debug or troubleshoot?
better than diverse ๏ A practice is easy to change ๏ Prevents useless discussions ๏ Have a single practice for everything: retry policy, secrets management, deployment tool, build system, test framework, OS/distro, RPC protocol, log format, monitoring software, …
๏ best language is what the team speaks ๏ (elegant & maintainable) > fast ๏ you don’t need fast ๏ you don’t need scalable ๏ hardware is cheaper than developers
you have a pool of machines (cluster) ๏ a set of services/tasks you want to run (and keep running, or periodically) ๏ you need an orchestrator/scheduler:
an intern deploy easily too? how confidently can you deploy? ๏ do you have enough tests? can you deploy without downtime? how long does it take to deploy all your company’s microservices?
you update configuration? can you redeploy all your stack as it was on a particular date? are your builds signed? does your code work the same on all your machines (hardware/OS etc)?
a commit from the source tree; not a dev machine. ๏ have homogenous execution environment on your machines ๏ same version of: pkgs, kernel, distro ๏ use Docker for reproducibility
running microservice in your cluster in your logs. ๏ run your microservices on readonly filesystems to prevent contention. ๏ have homogenous configuration for all instances of a service (etcd/consul/zk…)
(and automate) invest in tools that give you confidence conduct deployment drills and you will discover previously unknown bugs, unscripted deployment steps and pain points
it works ๏ A correctly working program is a very special case. Failure is the default. ๏ Have massive visibility in your systems. ๏ An intern should be able to query anything about your system very easily. ๏ Monitoring is cheap. Being blind to an outage is expensive.
a lot more dimensions such as “version”, “instanceid” http_requests {code=200, handler=new_user, method=get, version=2.0, id=3aebf531} 5310 http_requests {code=500, handler=new_user, method=get, version=2.0, id=3aebf531} 4
store results from counters. ๏ OpenTSDB, Graphite ๏ query: find total errors count for a specific region in the last 5 minutes: http_request{code=500, service=search, region=westus}[5m]
you SSH into PROD machines you are totally doing it wrong. ๏ SSH is un-auditable (you cannot track what an engineer is doing in a machine) ๏ humans contaminate servers and break homogeneity
๏ use structured logging ๏ use open source tools for log collection, storage and querying. ๏ store logs forever if you can, for further analysis or auditing. otherwise logrotate.
the correlation ID around to retrieve logs about a request from all services. ๏ You can put correlation IDs to headers. ๏ Add user-parameters to contexts and measure latency for each parameter, identify outliers.
will bring down the entire service. ๏ Have knobs/flags to disable features in production through configuration. ๏ When a bad deployment happens (in a rolling upgrade fashion) have ways to flip traffic to the old deployment. ๏ Search: blue/green deployments
A dumb retry policy between service RPCs will prevent the system from healing. ๏ Tell clients when and how to retry ๏ See: circuit breaker pattern ๏ Can you contain the failure? ๏ Is returning older/cached data O.K.?