Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10GbE時代のネットワークI/O高速化

 10GbE時代のネットワークI/O高速化

Avatar for Takuya ASADA

Takuya ASADA

June 07, 2013
Tweet

More Decks by Takuya ASADA

Other Decks in Technology

Transcript

  1. 1. ׂΓࠐΈ͕ଟ͗͢Δ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ύέοτड৴ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ
  2. چདྷͷύέοτड৴ॲཧ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ύέοτड৴

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ϋʔυ΢ΣΞׂΓࠐΈ
 ˣ
 ड৴ΩϡʔʹΩϡʔ Πϯά
 ˣ
 ιϑτ΢ΣΞׂΓࠐ Έεέδϡʔϧ
  3. NAPIʢϋΠϒϦουํ ࣜʣ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ ϋʔυ΢ΣΞׂΓࠐΈ
 ˣ
 ׂΓࠐΈແޮԽˍ
 ϙʔϦϯά։࢝ ↓ ύέοτ͕ແ͘ͳͬ ͨΒׂΓࠐΈ༗ޮԽ
  4. Interrupt CoalescingͷޮՌ • Intel 82599(ixgbe)ͰInterrupt Coalescingແޮɺ
 ༗ޮʢׂΓࠐΈස౓ࣗಈௐ੔ʣͰൺֱ • MultiQueue, GRO,

    LRO౳͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ interrupts throughput packets CPU%(sy+si) ແޮ 46687 int/s 7.82 Gbps 660386 pkt/s 97.6% ༗ޮ 7994 int/s 8.24 Gbps 711132 pkt/s 79.6%
  5. Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ

    ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ 2.ϓϩτίϧॲཧ͕ॏ͍
  6. TOE (TCP Offload Engine) • OSͰϓϩτίϧॲཧ͢ΔͷΛ΍ΊͯɺNICͰॲཧ͢Δ • σϝϦοτ • ηΩϡϦςΟɿTOEʹηΩϡϦςΟϗʔϧ͕ੜͯ͡΋ɺOS

    ଆ͔Βରॲ͕ग़དྷͳ͍ • ෳࡶੑɿOSͷωοτϫʔΫελοΫΛTOEͰஔ͖׵͑Δʹ ͸͔ͳΓ޿ൣғͷมߋ͕ඞཁ
 ϝʔΧʹΑͬͯTOEͷ࣮૷͕ҟͳΓڞ௨ΠϯλϑΣʔεఆ ͕ٛࠔ೉ • Linuxɿαϙʔτ༧ఆແ͠
  7. Checksum Offloading ͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • iperfͷTCPϞʔυͰܭଌ • MultiQueue͸ແޮԽ

    • ethtool -K ix0 rx off throughput CPU%(sy+si) ແޮ 8.27 Gbps 86 ༗ޮ 8.27 Gbps 85.2
  8. GROͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • MultiQueue͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ • ethtool

    -K ix0 gro off packets network stack called count throughput CPU%(sy+si) ແޮ 632139 pkt/s 632139 call/s 7.30 Gbps 97.6% ༗ޮ 712387 pkt/s 47957 call/s 8.25 Gbps 79.6%
  9. TSO (TCP Segmentation Offload) • LROͷٯ • ύέοτΛϑϥάϝϯτԽͤͣʹૹ৴
 NIC͕ύέοτΛMTUαΠζʹ෼ׂ •

    OS͸ύέοτ෼ׂॲཧΛলུग़དྷΔ • LinuxͰ͸ιϑτ΢ΣΞʹΑΔGSOɺ
 ϋʔυ΢ΣΞʹΑΔTSOʗUFOΛαϙʔτ
  10. TSOͷޮՌ • Intel 82599(ixgbe)Ͱൺֱ • MultiQueue͸ແޮԽ • iperfͷTCPϞʔυͰܭଌ • ethtool

    -K ix0 gso off tso off packets throughput CPU%(sy+si) ແޮ 247794 pkt/s 2.87 Gbps 53.5% ༗ޮ 713127 pkt/s 8.16 Gbps 26.8%
  11. 3.ෳ਺ͷCPUͰύέοτॲཧ͍ͨ͠ cpu0 Process(User) Process(Kernel) HW Intr Handler SW Intr Handler

    ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ cpu1 Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠
  12. ιϑτׂΓࠐΈͱ͸ʁ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ׂΓࠐΈແޮԽ

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ύέοτड৴ ύέοτ͕ແ͘ͳΔ ·Ͱ܁Γฦ͠ ϙʔϦϯά͔Β
 ϓϩτίϧॲཧ·Ͱ →ωοτϫʔΫIOͷ େ൒෦෼
  13. TCP Reordering ̍      ̍ 

        protocol processing user buffer SFPSEFS RVFVF    • ॱং͕ཚΕ͍ͯΔͱύέοτͷฒ΂௚ ͠ʢϦΦʔμʣ࡞ۀ͕ඞཁʹͳΔ
  14. RSS ʢReceive Side Scalingʣ • CPU͝ͱʹผʑͷड৴ΩϡʔΛ࣋ͭNIC
 ʢMultiQueue NICͱݺ͹ΕΔʣ • ड৴Ωϡʔ͝ͱʹಠׂཱͨ͠ΓࠐΈΛ࣋ͭ

    • ಉ͡ϑϩʔʹଐ͢Δύέοτ͸ಉ͡Ωϡʔ΁ɺ
 ҟͳΔϑϩʔʹଐ͢Δύέοτ͸ͳΔ΂͘ผͷ Ωϡʔ΁෼ࢄ
 ˠύέοτϔομͷϋογϡ஋Λܭࢉ͢ΔࣄʹΑ ΓѼઌΩϡʔΛܾఆ
  15. RSSʹΑΔ ύέοτৼΓ෼͚ NIC ύέοτ ύέοτ ύέοτ ϋογϡܭࢉ ύέοτண৴ hash queue

    σΟεύον ࢀর RX Queue #0 RX Queue #1 RX Queue #2 RX Queue #3 cpu0 cpu1 cpu2 cpu3 ड৴ॲཧ ׂΓࠐΈ ड৴ॲཧ ▪ ▪ 0 1
  16. Ωϡʔબ୒ͷखॱ indirection_table[64] = initial_value input[12] = 
 {src_addr, dst_addr, src_port,

    dst_port} key = toeplitz_hash(input, 12) index = key & 0x3f queue = indirection_table[index]
  17. cpu3 cpu2 cpu1 cpu0 ׂΓࠐΈແޮԽ ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ

    VTFS CV⒎FS socket queue ύέοτ γεςϜ ίʔϧ ϓϩηεىচ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ ύέοτ ύέοτ ιϑτ΢ΣΞׂΓࠐΈ ύέοτड৴ ϋογϡܭࢉ σΟεύον ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS socket queue backlog #1 hash queue ࢀর ▪ ▪ 0 1 $16ؒ ׂΓࠐΈ backlog #2 backlog #3
  18. RPS netperf result netperf benchmark result on lwn.net: e1000e on

    8 core Intel Without RPS: 90K tps at 33% CPU With RPS: 239K tps at 60% CPU ! foredeth on 16 core AMD Without RPS: 103K tps at 15% CPU With RPS: 285K tps at 49% CPU
  19. RFSͷ࢖͍ํ # echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus # echo 4096 >

    /sys/class/net/eth0/queues/rx-0/rps_flow_cnt # echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
  20. RFS netperf result netperf benchmark result on lwn.net: e1000e on

    8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU ! RPC test tps CPU% 50/90/99% usec latency StdDev No RFS or RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61
  21. Flow SteeringͰ खಈϑΟϧλઃఆ # ethtool --config-nfc ix00 flow-type tcp4 src-ip

    10.0.0.1 dst-ip 10.0.0.2 src-port 10000 dst-port 10001 action 6 Added rule with ID 2045
  22. XPSͷ࢖͍ํ # echo 1 > /sys/class/net/eth0/queues/tx-0/xps_cpus # echo 2 >

    /sys/class/net/eth0/queues/tx-1/xps_cpus # echo 4 > /sys/class/net/eth0/queues/tx-2/xps_cpus # echo 8 > /sys/class/net/eth0/queues/tx-3/xps_cpus
  23. Intel Data Direct I/O Technology • NIC͕DMAͨ͠ύέοτͷσʔλ͸ɺ࠷ॳʹCPU ͕ΞΫηεͨ࣌͠ʹඞͣΩϟογϡώοτϛεΛ ى͜͢
 ɹɹɹɹɹɹɹɹɹˣ

    • CPUͷLLCʢࡾ࣍ΩϟογϡʣʹDMAͯ͠͠·͑ʂ • ৽͍͠XeonͱIntel 10GbEͰαϙʔτ • OSରԠ͸ෆཁʢHW͕ಁաతʹఏڙ͢Δػೳʣ
  24. ίϐʔ͕ॏ͍ Process(User) Process(Kernel) HW Intr Handler SW Intr Handler ύέοτड৴

    ϓϩτίϧॲཧ ιέοτ ड৴ॲཧ Ϣʔβ ϓϩάϥϜ VTFS CV⒎FS input queue socket queue ύέοτ γεςϜίʔϧ ϓϩηεىচ ιϑτ΢ΣΞׂΓࠐΈεέδϡʔϧ ϋʔυ΢ΣΞׂΓࠐΈ Ϣʔβۭؒ΁ίϐʔ
  25. • ʢIntel I/O ATͱ΋ݺ͹ΕΔʣ • NICͷόοϑΝˠΞϓϦέʔγϣϯͷόο ϑΝ΁DMAసૹ • CPUෛՙΛ࡟ݮ •

    νοϓηοτʹ࣮૷ • CONFIG_NET_DMA=y in Linux Intel QuickData Technology
  26. جຊతͳ࢓૊Έ • ઐ༻NICυϥΠόͱઐ༻ ϥΠϒϥϦΛ༻͍ͯɺ NICͷड৴όοϑΝΛ MMAP • ύέοτΛϙʔϦϯά • ΞϓϦݻ༗ͷύέοτ

    ʹର͢ΔॲཧΛ࣮ߦ NIC RX1 RX2 RX3 Kernel Driver App RX1 RX2 RX3 MMAP 1BD LFUT Polling Do some work
  27. Intel DPDK • ׂΓࠐΈΛ΍ΊͯϙʔϦϯάΛ࢖༻͠Φʔόϔου࡟ݮ • ड৴όοϑΝʹHugePageΛ࢖͏ࣄʹΑΓTLB missΛ௿ݮ • 64 byte

    packetͷL3ϑΥϫʔσΟϯάੑೳʢIntelࢿྉΑΓʣ • Linux network stackɿXeon E5645 x 2 → 12.2Mpps • DPDKɿXeon E5645 x 1 → 35.2Mpps • DPDK : Next generation Intel Processor x 1 → 80Mpps
 • OpenvSwitchରԠ • ରԠNICɿIntel