Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The art of service discovery at scale at Strang...

Nitesh Kant
September 25, 2015

The art of service discovery at scale at Strangeloop 2015

Whether it is a simple DNS lookup or a complex dedicated solution, service discovery is the backbone of any microservices architecture and an immature solution can soon turn into an achilles' heel.

Nitesh Kant in this talk will introduce the concept of service discovery and various use cases it solves in a complex service based architecture. He will then be introducing Netflix's Eureka (https://github.com/Netflix/eureka); a highly-available, multi-datacenter aware service discovery solution built from scratch, it's architecture and how it is unique in this space, by favoring Availability over Consistency in the wake of network partitions.

Presented at Strangeloop 2015: http://www.thestrangeloop.com/2015/the-art-of-service-discovery-at-scale.html.

Video: https://www.youtube.com/watch?v=27ynM2tbNXM

Nitesh Kant

September 25, 2015
Tweet

More Decks by Nitesh Kant

Other Decks in Technology

Transcript

  1. The art of service discovery at scale Nitesh Kant, Software

    Engineer, Netflix Edge Engineering. @NiteshKant
  2. Nitesh Kant Who Am I? ❖ Engineer, Edge Engineering, Netflix.

    ❖ Core contributor, RxNetty* ❖ Contributor, Zuul** ❖ Ran Netflix’s Service Discovery for a year. ❖ Conceptualized and designed Eureka v2*** . * https://github.com/ReactiveX/RxNetty ** https://github.com/Netflix/zuul @NiteshKant *** https://github.com/Netflix/eureka
  3. X

  4. Service Name Nodes User Service 10.10.1.1, 10.10.1.2, 10.10.1.3, 10.10.1.4 Metadata

    Service 10.10.2.1 Recommendations Service 10.10.3.1, 10.10.3.2, 10.10.3.3
  5. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommendations Service 10.10.3.1, 10.10.3.2
  6. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommen dations 10.10.3.1, 10.10.3.2 G ET/PU T/D ELETE GET/PUT/DELETE GET/PUT/DELETE
  7. Delivering Netflix Edge Service Ratings Service Video Metadata Service Bookmarks

    Service Recommendations service Disclaimer: This is an example and not an exact representation of the processing
  8. Delivering Netflix Edge Service Disclaimer: This is an example and

    not an exact representation of the processing Recommendations service
  9. Delivering Netflix Edge Service Disclaimer: This is an example and

    not an exact representation of the processing Recommendations service Service Discovery Which instances of recommendations service are available now?
  10. Service Name Nodes User Service 10.10.1.1, 10.10.1.2 Metadata Service 10.10.2.1,

    10.10.2.2 Recommendations Service 10.10.3.1, 10.10.3.2
  11. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance.
  12. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance.
  13. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance. I am alive.
  14. Network Partitions Being available in a distributed system, is very

    subjective. What is available to one can be unavailable to other.
  15. Network Partitions Being available in a distributed system, is very

    subjective. What is available to one can be unavailable to other. A node’s availability decision is best when it is local.
  16. Node status override Ratings Service (Instance 1) Edge Service Ratings

    Service (Instance 2) Service Discovery Take Instance 2 out of service
  17. Node status override Ratings Service (Instance 1) Edge Service Ratings

    Service (Instance 2) Service Discovery Take Instance 2 out of service X
  18. Ephemeral Recs Service Instance Service discovery @Start I am a

    recommendations service @Shutdo I am not a recommendations service In dynamic environments, this duration is in order of hours or days.
  19. Service Discovery data is the instance information. It is always

    available to the instance. Re-generatable data
  20. Stateful Client-Server interaction Recs Service Instance Service discovery @Startup I

    am a recommendations service instance. @Shutdown I am not a recommendations service instance. I am alive.
  21. Recs Service Instance Service discovery @Startup I am a recommendations

    service instance. @Shutdown I am not a recommendations service instance. I am alive. A connection oriented, ordered and reliable protocol.
  22. Recs Service Instance Service Discovery Node 1 Register Heartbeat Acks

    Recs Service Instance Service Discovery Node 2 Register Heartbeat Acks
  23. Recs Service Instance Service Discovery Node 1 Register Heartbeat Acks

    Recs Service Instance Service Discovery Node 2 Register Heartbeat Acks
  24. Recs Service Instance Service Discovery Node 1 Recs Service Instance

    Service Discovery Node 2 IP: 10.10.2.1 Port: 7001 Status: Starting IP: 10.10.2.1 Port: 7001 Status: UP Conflicts
  25. Service Discovery Node 3 IP: 10.10.2.1 Port: 7001 Status: Starting

    IP: 10.10.2.1 Port: 7001 Status: UP From Service Discovery Node 1 From Service Discovery Node 2 Conflicts
  26. Recs Service Instance Service Discovery Node 1 What should be

    the state of Recs Service Instance data now? IP: 10.10.2.1 Port: 7001 Status: Starting
  27. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. IP: 10.10.2.1 Port: 7001 Status: Starting
  28. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. Typically, wait for a while for a reconnect before eviction. IP: 10.10.2.1 Port: 7001 Status: Starting
  29. Recs Service Instance Service Discovery Instance 1 What should be

    the state of Recs Service Instance data now? Show tolerance towards broken connections, evicting an instance too early causes churn. Typically, wait for a while for a reconnect before eviction. Evict in absence of reconnects after a while. IP: 10.10.2.1 Port: 7001 Status: Starting
  30. Service Discovery Node 3 IP: 10.10.2.1 Port: 7001 Status: Starting

    IP: 10.10.2.1 Port: 7001 Status: UP From Service Discovery Node 1 From Service Discovery Node 2 Conflicts X
  31. Tolerate temporary conflicts Node 1 Data Node 2 Data Node

    3 Data Service Discovery Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  32. Read from a version (till it is gone) Service Discovery

    Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  33. Read from a version (till it is gone) Service Discovery

    Node 1 Service Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end X Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  34. Read from a version (till it is gone) Service Discovery

    Node 2 Service Discovery Node 3 Updates Updates Read end Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN X
  35. Steady State Service Discovery Node 2 Updates Read end Node

    2 Data IP: 10.10.2.1 Port: 7001 Status: DOWN
  36. Time to converge (worst case) Service Discovery Node 1 Service

    Discovery Node 2 Service Discovery Node 3 Updates Updates Updates Read end Latest Oldest Node 1 Data Node 2 Data Node 3 Data IP: 10.10.2.1 Port: 7001 Status: UP IP: 10.10.2.1 Port: 7001 Status: DOWN IP: 10.10.2.1 Port: 7001 Status: STARTING
  37. Time to converge (worst case) Time to evict stale copies.

    (Constant) Time to replicate from the owner node. +
  38. Time to replicate from the owner node Somewhat bounded Heartbeat

    interval * Number of tolerated missing heartbeats
  39. Service discovery data is an ordered stream. IP: 10.10.2.1 Port:

    7001 Status: Starting Status: UP Status: DOWN
  40. IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN

    IP: 10.10.2. Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.4 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.3 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN IP: 10.10.2.1 Port: 7001 Status: Starting Status: UP Status: DOWN Service discovery data is a ordered stream merged ordered stream
  41. Data as a stream (Lookup) Edge Service Instance Service Discovery

    Node 1 Give me all recs service instances ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting
  42. Edge Service Instance Give me all recs service instances ID:

    1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting Data as a stream (Lookup) ID: 1 Status: DOWN ID: 2 Status: UP Service Discovery Node 1
  43. Edge Service Instance Give me all recs service instances ID:

    1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting ID: 1 Status: DOWN ID: 2 Status: UP Data Diffs Data as a stream (Lookup) Service Discovery Node 1
  44. Data as a stream (Replication) Service Discovery Node 1 Service

    Discovery Node 2 Give me all “non-replicated” instances ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting
  45. Data as a stream (Replication) Service Discovery Node 1 Service

    Discovery Node 2 ID: 1 IP: 10.10.2.1 Port: 7001 Status: UP ID: 2 IP: 10.10.2.2 Port: 7001 Status: Starting Service Discovery Node 3 Service Discovery Node 4 ID: 3 IP: 10.10.2.3 Port: 7001 Status: UP ID: 4 IP: 10.10.2.4 Port: 7001 Status: Starting Give me all “non-replicated” instances ID: 5 IP: 10.10.2.5 Port: 7001 Status: UP ID: 6 IP: 10.10.2.6 Port: 7001 Status: UP ID: 4 IP: 10.10.2.4 Port: 7001 Status: UP
  46. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Service Discovery Nodes can talk to each other.
  47. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Service Discovery Nodes can talk to each other. No outside instance can talk to a node.
  48. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X
  49. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X X X
  50. One fine day …. … you lost most of your

    service instances … because … one node of service discovery was partitioned
  51. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 X
  52. Service Discovery Node 1 X X X X Service Discovery

    Node 2 Service Discovery Node 3 Preserve the data.
  53. Don’t make Service Discovery your Achilles heel. Service discovery controls

    visibility of nodes but does not guarantee availability.