Practices for Deploying and Running Microservices

Deploying and Running Microservices @AhmetAlpBalkan Software Engineer at Microsoft

@AhmetAlpBalkan Go programming language Contributor/Maintainer Open Source/Linux at

• ported docker client to windows (docker.exe) • docker-machine for
Azure • docker-registry for Azure • Docker C# client library (docker.dotnet) • Microsoft’s ﬁrst oﬃcial docker image (asp.net) Go • Azure SDK for Go

This presentation ๏ is a thought exercise; not solutions ๏
based on publications by  Google, Microsoft, Netﬂix etc. ๏ based on experience talking to customers ๏ common practices adopted in the industry

Survey time

Microservices Architecture

Microservices Architecture process monolithic

Microservices Architecture process monolithic microservices

process monolithic microservices Microservices Architecture

Microservices Architecture process monolithic microservices API API API

machines Microservices cluster from 30.000ft

Microservices cluster from 30.000ft machines

How do you come up with a microservice?

Unix Philosophy

Unix Philosophy Doug McIlroy (1973) ๏ Write programs that do
one thing  and do it well, ๏ Write programs to work together ๏ Write programs to handle text streams*, because that is a universal interface

Building blocks ๏ Independent microservices ๏ Orchestrator to deploy/run services
๏ Load balancing solution ๏ Networking to inter-connect services ๏ API (e.g. HTTP REST, protobuf, gRPC)

Microservices axioms ๏ services can scale out independently ๏ services
get added/removed all the time ๏ services can discover each other ๏ services talk to each other via RPC/API ๏ machines go down/become unreachable ๏ services crash/become unresponsive ๏ you will see all sorts of weirdness

Microservices are good for you ๏ It totally makes sense
๏ loose coupling ๏ separation of concerns ๏ It is independently scalable ๏ Easy to write new services ๏ in diﬀerent languages ๏ Bugs & failure are more contained

So what’s the problem? Microservices are harder to manage 1
monolith vs. 20 microservices More complicated deployments Too many failure points Too many moving parts A lot more units to monitor/alert upon

Concerns how do you deploy your services? how do you
release and roll out new versions for a service? how do you reschedule/move services between machines? where do you store application state? how do you move data between machines?

…more concerns how do you handle failed services? how do
you discover new/removed instances for services? what is the target uptime/SLO (service level objective) for each service? how do you monitor service health? how and when do you alert humans?

…more concerns how do you do logging? how easy is
it to deploy services? how do you respond to incidents? ๏ how (fast) do you identify the problem? ๏ how (fast) do you mitigate/repair? ๏ how do you debug or troubleshoot?

Microservices is no silver bullet. It requires a well thought
out architecture & tooling.

An application split into microservices is often an order of
magnitude more complex to deploy, run and monitor.

So you want to use microservices…

Journey to Microservices ๏ Pick a mental model and practices
๏ Pick your tooling ๏ Releasing ๏ Monitoring ๏ Livesite

Microservices is a mentality shift. Maintain an open mind.

Philosopher King In Plato’s ideal city-state: ”philosophers [must] become kings…or
those now called kings [must]…genuinely and adequately philosophize” You should have philosopher kings in your company.

Have one way of doing things.

Have one way of doing things. ๏ Bad is miles
better than diverse ๏ A practice is easy to change ๏ Prevents useless discussions ๏ Have a single practice for everything: retry policy, secrets management,  deployment tool, build system,  test framework, OS/distro, RPC protocol,  log format, monitoring software, …

Pick your technology stack.

Pick your technology stack. ๏ multiple programming languages are okay
๏ best language is what the team speaks ๏ (elegant & maintainable) > fast ๏ you don’t need fast ๏ you don’t need scalable ๏ hardware is cheaper than developers

Pick your tooling

Invest in tools that gives you automation and conﬁdence ๏
Orchestrator ๏ Monitoring ๏ Alerting ๏ Service Discovery ๏ Load Balancing

Orchestrators ๏ are the operating systems for  your datacenter: ๏
you have a pool of machines (cluster) ๏ a set of services/tasks you want to  run (and keep running, or periodically) ๏ you need an orchestrator/scheduler:

Orchestrators ..handle heavy lifting of service lifecycles: ๏ deploy a
service ๏ upgrade a service ๏ rolling upgrades without downtime ๏ rollbacks ๏ reschedule services if machines fail ๏ restart services if they crash

Release Engineering

Deployment Concerns how many steps to deploy ServiceA? ๏ can
an intern deploy easily too? how conﬁdently can you deploy? ๏ do you have enough tests? can you deploy without downtime? how long does it take to deploy all your company’s microservices?

Deployment Concerns can you rollback a new deployment? how do
you update conﬁguration? can you redeploy all your stack as it was on a particular date? are your builds signed? does your code work the same on all your machines (hardware/OS etc)?

Testing ๏ Tests are meant to give you conﬁdence ๏
green = deploy ๏ red = do not deploy ๏ If you hesitate deploying a green build  you are not testing enough.

Build Reproducibility ๏ any code that goes to PROD is
a commit from the source tree; not a dev machine. ๏ have homogenous execution environment on your machines ๏ same version of: pkgs, kernel, distro ๏ use Docker for reproducibility

Build Reproducibility ๏ have ability to see versions of each
running microservice in your cluster in your logs. ๏ run your microservices on readonly ﬁlesystems to prevent contention. ๏ have homogenous conﬁguration for all instances of a service (etcd/consul/zk…)

Deployment Pro-tips automate everything if it hurts, do it more!
(and automate) invest in tools that give you conﬁdence conduct deployment drills and you will discover previously unknown bugs, unscripted deployment steps and pain points

Monitoring

Monitoring philosophy ๏ If you can’t measure, you don’t know
it works ๏ A correctly working program is a very special case. Failure is the default. ๏ Have massive visibility in your systems. ๏ An intern should be able to query anything about your system very easily. ๏ Monitoring is cheap. Being blind to an outage is expensive.

Why do we monitor? ๏ something is broken, alert humans
๏ analyze long-term trends ๏ compare if v2.1 is faster than v2.0 ๏ build dashboards and observe anomalies

What do we monitor? ๏ health ๏ latency ๏ error
rate ๏ request count ๏ resource utilization (cpu/memory)

Black-box monitoring ๏ monitor a system from user’s perspective ๏
GET /home → 200 OK ๏ creating user works ๏ a particular result appears in search results

Black-box monitoring ๏ particularly not helpful if you have complex
systems. user requests are load balanced. ๏ tells you the light bulb is turned on; but not how hot it is.

White-box monitoring ๏ ask a system for its internal details
GET /stats mem.free 312  cpu.avg 0.15  http.500 1  http.404 12  http.200 5698  threads 24  uptime 3m14s

White-box monitoring ๏ in reality with a lot metrics with
a lot more dimensions such as “version”, “instanceid” http_requests {code=200, handler=new_user,   method=get, version=2.0, id=3aebf531} 5310  http_requests {code=500, handler=new_user,  method=get, version=2.0, id=3aebf531} 4 

White-box monitoring ๏ you can extend services with internal metrics
and export other counters from in- memory ๏ check out Prometheus

Aggregation ☹average does not mean anything ☹median does not mean
anything either ☹95% nope still not there ☹99% nope 99.9% maybe… have visibility for 99.95%

Aggregation ๏ Use a time-series database. They sample, aggregate and
store results from counters. ๏ OpenTSDB, Graphite ๏ query: ﬁnd total errors count for a speciﬁc region in the last 5 minutes:    http_request{code=500, service=search, region=westus}[5m]

Logging

Motivation for Logging ๏ logs are for debugging ๏ if
you SSH into PROD machines you are totally doing it wrong. ๏ SSH is un-auditable (you cannot track what an engineer is doing in a machine) ๏ humans contaminate servers and break homogeneity

Logs ๏ stick with 12factor.net principles,  log messages to stdout.
๏ use structured logging ๏ use open source tools for log collection, storage and querying. ๏ store logs forever if you can, for further analysis or auditing. otherwise logrotate.

Log contexts ๏ Generate random GUIDs for correlation ๏ Pass
the correlation ID around to retrieve logs about a request from all services. ๏ You can put correlation IDs to headers. ๏ Add user-parameters to contexts and measure latency for each parameter, identify outliers.

Live Site Engineering

At the end of the day, only the user experience
matters.

Live Site Philosophy ๏ In a high scale everything _will_
go wrong: compiler bugs garbage collection bugs kernel freezes ๏ In an outage, mitigation is the ﬁrst priority. ๏ MTTR (mean time to repair)

Live Site Philosophy ๏ Do not just restart and call
it a day …you are an educated person ๏ If it happened once, it will happen again ๏ Learn from every single incident

Things will go wrong ๏ Your newly developed small feature
will bring down the entire service. ๏ Have knobs/flags to disable features in production through configuration. ๏ When a bad deployment happens (in a rolling upgrade fashion) have ways to flip traffic to the old deployment. ๏ Search: blue/green deployments

Things will go wrong ๏ Carefully plan failure modes ๏
A dumb retry policy between service RPCs will prevent the system from healing. ๏ Tell clients when and how to retry ๏ See: circuit breaker pattern ๏ Can you contain the failure? ๏ Is returning older/cached data O.K.?

Service Level Objectives ๏ You cannot be 100% up: it
is impossible. ๏ Find your SLO targets for each service: ๏ uptime ๏ latency ๏ Identify and analyze your dependencies.

Wrap up

๏ Microservices is not for everybody. ๏ You should have
standardized practices. ๏ Automate everything. ๏ Developer time is expensive, use it carefully. ๏ If you are not monitoring, you don’t know.

Practices for Deploying and Running Microservices

Practices for Deploying and Running Microservices

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Featured

Transcript