Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ONOS Distributed Core and performance- AT&T tal...

ONOS Distributed Core and performance- AT&T talk (May 7th)

Madan Jampani, ON.Lab

ONOS Project

May 07, 2015
Tweet

More Decks by ONOS Project

Other Decks in Technology

Transcript

  1. Control Plane: Design Principles • Distributed ◦ Divide and conquer

    the problem space • Symmetric ◦ Instances are identical with respect to form and function • Fault-tolerant ◦ Handle failures seamlessly through built-in redundancy • Decentralized ◦ System is self-organizing and lacks any centralized control • Incrementally Scalable ◦ Capacity can be introduced gradually • Operational Simplicity ◦ Easy to deploy, no special hardware, no synchronized clocks, etc.
  2. Key Challenges • Preserve simple SDN abstractions without compromising performance

    at scale • Match problems to solutions that can meet their consistency/availability/durability needs • Provide strong consistency at scale • Expose distributed state management and coordination primitives as key application building blocks
  3. App App App App App App App App App System

    Model • Each controller manages a portion of network • Controllers communicate with each other via RPC • All core services are accessible on any instance • Applications are distribution transparent
  4. Topology (Global Network View) • Partitioned by Controller ◦ no

    single controller has direct visibility over entire network • Reasonable size ◦ can fit into main memory on a given controller • Read-intensive ◦ low latency access to GNV is critical • Consistent with Environment ◦ incorporate network updates as soon as possible
  5. Flow State • Per switch state • Exhibits strong read

    locality • Authoritative source of what data plane should contain • Backup data for quick recovery • High level application edicts • Intents are immutable and durable while Intent states are eventually consistent • Topology events can trigger intent rerouting enmasse Intent State
  6. Flow State • Primary/backup replication. Switch master is primary •

    Backup location is the node that will most likely succeed the current master. • Fully replicated using an Eventually Consistent Map • Partitioned Logical Queues enable synchronization free execution Intent State
  7. State Management Primitives • EventuallyConsistentMap<K, V> • ConsistentMap<K, V> •

    LeadershipService • AtomicCounter * • DistributedSet * * Will be available in Cardinal release
  8. Performance Metrics • Device & link sensing latency ◦ measure

    how fast can controller react to environment changes, such as switch or port down to rebuild the network graph and notify apps • Flow rule operations throughput ◦ measure how many flow rule operations can be issued against the controller and characterize relationship of throughput with cluster size • Intent operations throughput ◦ measure how many intent operations can be issued against controller cluster and characterize relationship of throughput with cluster size • Intent operations latency ◦ measure how fast can the controller react to environment changes and reprovision intents on the data-plane and characterize scalability
  9. Topology Latency • Verify: ◦ observe effect of distributed state

    management on latency ◦ react faster to negative events than positive ones • Results consist of multiple parts: ◦ Switch up latency ◦ Switch down latency ◦ Link up/down latency • Experimental setup: ◦ Two OVS switches connected to each other. ◦ Events are generated from the switch ◦ Elapsed time is measured from the switch until ONOS triggers a corresponding topology event
  10. Switch Up Latency • Most of the time is spent

    waiting for the switch to respond to a features request. (~53ms) • ONOS spends under 25ms with most of it’s time electing a master for the device. ◦ Which is a strongly consistent operation
  11. Switch Down Latency • Significantly faster because there is no

    negotiation with the switch • A terminating TCP connection unequivocally indicates that the switch is gone
  12. Link Up/Down Latency • The increase from single to multi

    instance is being investigated • Since we use LLDP to discover links, it takes longer to discover a link coming up than going down • Port down event trigger immediate teardown of the link.
  13. Flow Throughput • ONOS may have to provision flows for

    many devices • Objective is to understand how flow installation scales: ◦ with increased east/west communication with the cluster ◦ number of devices connected to each instance • Experimental setup: ◦ Constant number of flows ◦ Constant number of devices attached to cluster ◦ Mastership evenly distributed ◦ Variable number for flow installers ◦ Variable number separate device masters traversed.
  14. Flow Throughput results • Single instance can install over 500K

    flows per second • ONOS can handle 3M local and 2M non local flow installations • With 1-3 ONOS instances, the flow setup rate remains constant no matter how many neighbours are involved • With more than 3 instances injecting load the flow performance drops off due to extra coordination requires.
  15. Intent Framework Performance • Intents are high-level, network-level policy definitions

    ◦ e.g. provide connectivity between two hosts, or route all traffic that matches this prefix to this edge port • All objects are distributed for high availability ◦ Synchronously written to at least one other node ◦ Work is divided among all instances in the cluster • Intents must be compiled into device-specific rules ◦ Paths are computed and selected, reservations made, etc. • Device-specific rules are installed ◦ Leveraging other asynchronous subsystems (e.g. Flow Rule Service) • Intents react to network events ("reroute") ◦ e.g. device or link failure, host movement, etc.
  16. Intent Framework Latency • API calls are asynchronous and return

    after storing the intent • After an intent has been submitted, the framework starts compiling and installing • An event is generated after the framework confirms that the policy has been written to the devices • Experiment shows how quickly an application's policy can be reflected in the network ("install" and "withdrawn"), as well as how long it takes to react to a network event ("reroute")
  17. Intent Latency Results • Less than 100ms to install or

    withdraw a batch of intents • Less than 50ms to process and react to network events ◦ Slightly faster because intent objects are already replicated
  18. Intent Framework Throughput • Dynamic networks undergo changes of policies

    (e.g. forwarding decision) on an ongoing basis • Framework needs to be able to cope with a stream of requests and catch-up if it ever falls behind • Capacity needs to scale with growth of the cluster