Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10GbE時代のネットワークI/O高速化

 10GbE時代のネットワークI/O高速化

Takuya ASADA

June 07, 2013
Tweet

More Decks by Takuya ASADA

Other Decks in Technology

Transcript

  1. 1. ׂΓࠐΈ͕ଟ͗͢Δ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ύέοτड৴ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ
  2. چདྷͷύέοτड৴ॲཧ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ύέοτड৴

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ϋʔυ΢ΣΞׂΓࠐΈ
 ˣ
 ड৴ΩϡʔʹΩϡʔ Πϯά
 ˣ
 ιϑτ΢ΣΞׂΓࠐ Έεέδϡʔϧ
  3. NAPIʢϋΠϒϦουํ ࣜʣ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ ϋʔυ΢ΣΞׂΓࠐΈ
 ˣ
 ׂΓࠐΈແޮԽˍ
 ϙʔϦϯά։࢝ ↓ ύέοτ͕ແ͘ͳͬ ͨΒׂΓࠐΈ༗ޮԽ
  4. Interrupt CoalescingͷޮՌ • Intel 82599(ixgbe)ͰInterrupt Coalescingແޮɺ
 ༗ޮʢׂΓࠐΈස౓ࣗಈௐ੔ʣͰൺֱ • MultiQueue, GRO,

    LRO౳͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ interrupts throughput packets CPU%(sy+si) ແޮ 46687 int/s 7.82 Gbps 660386 pkt/s 97.6% ༗ޮ 7994 int/s 8.24 Gbps 711132 pkt/s 79.6%
  5. Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ

    ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ 2.ϓϩτίϧॲཧ͕ॏ͍
  6. TOE (TCP Offload Engine) • OSͰϓϩτίϧॲཧ͢ΔͷΛ΍ΊͯɺNICͰॲཧ͢Δ • σϝϦοτ • ηΩϡϦςΟɿTOEʹηΩϡϦςΟϗʔϧ͕ੜͯ͡΋ɺOS

    ଆ͔Βରॲ͕ग़དྷͳ͍ • ෳࡶੑɿOSͷωοτϫʔΫελοΫΛTOEͰஔ͖׵͑Δʹ ͸͔ͳΓ޿ൣғͷมߋ͕ඞཁ
 ϝʔΧʹΑͬͯTOEͷ࣮૷͕ҟͳΓڞ௨ΠϯλϑΣʔεఆ ͕ٛࠔ೉ • Linuxɿαϙʔτ༧ఆແ͠
  7. Checksum Offloading ͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • iperfͷTCPϞʔυͰܭଌ • MultiQueue͸ແޮԽ

    • ethtool -K ix0 rx off throughput CPU%(sy+si) ແޮ 8.27 Gbps 86 ༗ޮ 8.27 Gbps 85.2
  8. GROͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • MultiQueue͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ • ethtool

    -K ix0 gro off packets network stack called count throughput CPU%(sy+si) ແޮ 632139 pkt/s 632139 call/s 7.30 Gbps 97.6% ༗ޮ 712387 pkt/s 47957 call/s 8.25 Gbps 79.6%
  9. TSO (TCP Segmentation Offload) • LROͷٯ • ύέοτΛϑϥάϝϯτԽͤͣʹૹ৴
 NIC͕ύέοτΛMTUαΠζʹ෼ׂ •

    OS͸ύέοτ෼ׂॲཧΛলུग़དྷΔ • LinuxͰ͸ιϑτ΢ΣΞʹΑΔGSOɺ
 ϋʔυ΢ΣΞʹΑΔTSOʗUFOΛαϙʔτ
  10. TSOͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • MultiQueue͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ • ethtool

    -K ix0 gso off tso off packets throughput CPU%(sy+si) ແޮ 247794 pkt/s 2.87 Gbps 53.5% ༗ޮ 713127 pkt/s 8.16 Gbps 26.8%
  11. 3.ෳ਺ͷCPUͰύέοτॲཧ͍ͨ͠ cpu0 Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ cpu1 Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠
  12. ιϑτׂΓࠐΈͱ͸ʁ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ ϙʔϦϯά͔Β
 ϓϩτίϧॲཧ·Ͱ →ωοτϫʔΫIOͷ େ൒෦෼
  13. TCP Reordering ̍      ̍ 

        protocol processing user buffer SFPSEFS RVFVF    • ॱং͕ཚΕ͍ͯΔͱύέοτͷฒ΂௚ ͠ʢϦΦʔμʣ࡞ۀ͕ඞཁʹͳΔ
  14. RSS ʢReceive Side Scalingʣ • CPU͝ͱʹผʑͷड৴ΩϡʔΛ࣋ͭNIC
 ʢMultiQueue NICͱݺ͹ΕΔʣ • ड৴Ωϡʔ͝ͱʹಠׂཱͨ͠ΓࠐΈΛ࣋ͭ

    • ಉ͡ϑϩʔʹଐ͢Δύέοτ͸ಉ͡Ωϡʔ΁ɺ
 ҟͳΔϑϩʔʹଐ͢Δύέοτ͸ͳΔ΂͘ผͷ Ωϡʔ΁෼ࢄ
 ˠύέοτϔομͷϋογϡ஋Λܭࢉ͢ΔࣄʹΑ ΓѼઌΩϡʔΛܾఆ
  15. RSSʹΑΔ ύέοτৼΓ෼͚ NIC ύέοτ ύέοτ ύέοτ ϋογϡܭࢉ ύέοτண৴ hash queue

    σΟεύον ࢀর RX Queue #0 RX Queue #1 RX Queue #2 RX Queue #3 cpu0 cpu1 cpu2 cpu3 ड৴ॲཧ ׂΓࠐΈ ड৴ॲཧ ▪ ▪ 0 1
  16. Ωϡʔબ୒ͷखॱ indirection_table[64] = initial_value input[12] = 
 {src_addr, dst_addr, src_port,

    dst_port} key = toeplitz_hash(input, 12) index = key & 0x3f queue = indirection_table[index]
  17. cpu3 cpu2 cpu1 cpu0 ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ

    VTFS CV⒎FS socket queue ύέοτ γεςϜ ίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈ ύέοτड৴ ϋογϡܭࢉ σΟεύον ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue backlog #1 hash queue ࢀর ▪ ▪ 0 1 $16ؒ ׂΓࠐΈ backlog #2 backlog #3
  18. RPS netperf result netperf benchmark result on lwn.net: e1000e on

    8 core Intel Without RPS: 90K tps at 33% CPU With RPS: 239K tps at 60% CPU ! foredeth on 16 core AMD Without RPS: 103K tps at 15% CPU With RPS: 285K tps at 49% CPU
  19. RFSͷ࢖͍ํ # echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus # echo 4096 >

    /sys/class/net/eth0/queues/rx-0/rps_flow_cnt # echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
  20. RFS netperf result netperf benchmark result on lwn.net: e1000e on

    8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU ! RPC test tps CPU% 50/90/99% usec latency StdDev No RFS or RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61
  21. Flow SteeringͰ खಈϑΟϧλઃఆ # ethtool --config-nfc ix00 flow-type tcp4 src-ip

    10.0.0.1 dst-ip 10.0.0.2 src-port 10000 dst-port 10001 action 6 Added rule with ID 2045
  22. XPSͷ࢖͍ํ # echo 1 > /sys/class/net/eth0/queues/tx-0/xps_cpus # echo 2 >

    /sys/class/net/eth0/queues/tx-1/xps_cpus # echo 4 > /sys/class/net/eth0/queues/tx-2/xps_cpus # echo 8 > /sys/class/net/eth0/queues/tx-3/xps_cpus
  23. Intel Data Direct I/O Technology • NIC͕DMAͨ͠ύέοτͷσʔλ͸ɺ࠷ॳʹCPU ͕ΞΫηεͨ࣌͠ʹඞͣΩϟογϡώοτϛεΛ ى͜͢
 ɹɹɹɹɹɹɹɹɹˣ

    • CPUͷLLCʢࡾ࣍ΩϟογϡʣʹDMAͯ͠͠·͑ʂ • ৽͍͠XeonͱIntel 10GbEͰαϙʔτ • OSରԠ͸ෆཁʢHW͕ಁաతʹఏڙ͢Δػೳʣ
  24. ίϐʔ͕ॏ͍ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ύέοτड৴

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ
  25. • ʢIntel I/O ATͱ΋ݺ͹ΕΔʣ • NICͷόοϑΝˠΞϓϦέʔγϣϯͷόο ϑΝ΁DMAసૹ • CPUෛՙΛ࡟ݮ •

    νοϓηοτʹ࣮૷ • CONFIG_NET_DMA=y in Linux Intel QuickData Technology
  26. جຊతͳ࢓૊Έ • ઐ༻NICυϥΠόͱઐ༻ ϥΠϒϥϦΛ༻͍ͯɺ NICͷड৴όοϑΝΛ MMAP • ύέοτΛϙʔϦϯά • ΞϓϦݻ༗ͷύέοτ

    ʹର͢ΔॲཧΛ࣮ߦ NIC RX1 RX2 RX3 Kernel Driver App RX1 RX2 RX3 MMAP 1BD LFUT Polling Do some work
  27. Intel DPDK • ׂΓࠐΈΛ΍ΊͯϙʔϦϯάΛ࢖༻͠Φʔόϔου࡟ݮ • ड৴όοϑΝʹHugePageΛ࢖͏ࣄʹΑΓTLB missΛ௿ݮ • 64 byte

    packetͷL3ϑΥϫʔσΟϯάੑೳʢIntelࢿྉΑΓʣ • Linux network stackɿXeon E5645 x 2 → 12.2Mpps • DPDKɿXeon E5645 x 1 → 35.2Mpps • DPDK : Next generation Intel Processor x 1 → 80Mpps
 • OpenvSwitchରԠ • ରԠNICɿIntel