Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementation of Cumo, a CUDA-aware version of...

Implementation of Cumo, a CUDA-aware version of Ruby/Numo

A report about Cumo project at Ruby Association Grant 2017.

Naotoshi Seo

July 07, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Implementation of Ruby/Cumo, a CUDA-aware version of Ruby/Numo Naotoshi Seo

    July 07, 2018 Grant 2017 Report https://github.com/sonots/cumo
  2. Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd.

    • CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 2
  3. Outline 3 • Project Introduction • Cumo Features • Notices

    (or Difficulties) I met • Feature Proposals to Ruby • Project Achievement
  4. Why GPU? • GPU is bad at branching • GPU

    simplifies branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO
  5. 4DJFOUJpD$PNQVUJOHJO1ZUIPO 9 NumPy CuPy PyCUDA Chainer TensorFlow MXNet Cython pybind11

    DNN Tensor CUDA binding Useful tools for writing bindings C++ C++ Python 1PTJUJPOPG$VNP
  6. 4DJFOUJpD$PNQVUJOHJO3VCZ 10 Numo/NArray Cumo RbCUDA Red-chainer TensorFlow.rb MXNet.rb Rubex /"

    DNN Tensor CUDA binding Useful tools for writing bindings (or NMatrix) PyCall Binding to Python C++ C++ Ruby 1PTJUJPOPG$VNP
  7. Outline 11 • Project Proposal • Cumo Features • Notices

    (or Difficulties) I met • Feature Proposals to Ruby • Project Achievement
  8. 12

  9. 13

  10. Cumo Features • Highly compatible with Ruby/Numo • Element-wise operations

    • Reduction operations • Dot operation using cuBLAS • CUDA memory pool • JIT compilation of user-defined functions 14
  11. Highly Compatible with Numo • Ruby/Numo users can easily switch

    into Cumo to leverage power of GPU 15 pOEOBNF SCca YBSHTTFEJa FT/VNP$VNPHa FTOVNPDVNPH JGHQV
 SFRVJSFDVNPOBSSBZ
 9VNP$VNP
 FMTF
 SFRVJSFOVNPOBSSBZ
 9VNP/VNP
 FOE
 B9VNP%'MPBU[FSPT   
 C9VNP%'MPBUPOFT   
 DB C $VNP'FBUVSFT
  12. Element-wise opeations • Element-wise is like matrix additions • All

    elements are independent • Easy to perform in parallel 16 1 2 3 4 5 6 2 3 4 5 6 7 + A B 3 5 7 9 11 13 = C       Thread
 IDs $VNP'FBUVSFT
  13. Reduction opeations • Like sum • sum([1,2,3,4]) #=> 10 •

    Elements are not independent • Not so easy to perform in parallel 17 $VNP'FBUVSFT
  14. Dot product (GEMM) using cuBLAS • Dot is more complicated

    than reduction • NVIDIA's cuBLAS library supports it as GEMM (GEneral matrix-matrix mulitplication) and fast • However, cuBLAS supports only f-contiguous (column major) although we write CRuby extensions in C (c-contiguous, raw-major) 20 1 2 3 4 5 6 7 8 9 1 4 7 2 5 8 3 6 9 C-contiguous F (Fortran) $VNP'FBUVSFT
  15. Using cuBLAS with c- contiguous data 21 A = [1,

    2, 3, 4, 5, 6] B = [1, 2, 3, 4, 5, 6] C = [9, 12, 15, 19, 26, 33, 29, 40, 51] A = [1, 2, 3, 4, 5, 6] B = [1, 2, 3, 4, 5, 6] C = [9, 12, 15, 19, 26, 33, 29, 40, 51] $DPOUJVHPVT 3PXNBKPS 'DPOUJVHPVT $PMVNONBKPS No data copy, changing only attributes (shape) https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html %PUQSPEVDU
  16. Why We Need Memory Pool 23 $6%".FNPSZ1PPM • cudaMalloc /

    cudaFree makes slow • memory allocation / free themselves are slow • cudaMalloc synchronizes CPU and GPU CPU GPU Free Kernel1 cudaFree synchronize Idle Kernel2 something Idle cudaMalloc synchronize Malloc Launch Launch
  17. Memory Pool 24 $6%".FNPSZ1PPM • Cache to memory pool •

    Avoid cudaMalloc / Free as much as possible
  18. CUDA Memory Pool 25 512 1024 1536 2048 2560 ….

    Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split next prev use 1. Round up memory size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-fit with Coalescing (BFC), which is the one used in malloc(3) $6%".FNPSZ1PPM
  19. 26 • Cumo supports users to write their own CUDA

    kernel on Ruby • JIT compile using NVRTC (NVIDIA Runtime Compilation), and caches it on file system. JIT compiling user-defined functions (planned) kernel = Cumo::ElementwiseKernel.new(
 'float32 x, float32 y, float32 z',
 'float32 w', # output type
 'w = (x * y) + z;', # CUDA code
 'my_kernel')
 w = kernel.call(x, y, z) $VNP'FBUVSFT
  20. Element-wise kernel 4J[F /VNP $VNP ?   ? 

     ?   ?   ?   a = Xumo::Float32.ones(size) b = Xumo::Float32.ones(size) a + b 40 times faster for size of 10^8 28 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  21. Dot product 29 4J[F /VNP $VNP ?   ?

      ?   ?   ?   a = Xumo::Float32.ones(100, size/100) b = Xumo::Float32.ones(size/100, 100) a.dot(b) 2800 times faster for size of 10^8 UIJT ※ Numo without Numo/Linalg Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  22. red-chainer mnist example 30 • 20 times faster w/o memory

    pool • 75 times faster w/ memory pool Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  23. Outline 31 • Project Proposal • Cumo Features • Notices

    (or Difficulties) I met • Feature Proposals to Ruby • Project Achievement
  24. Notices (or Difficulties) I met • GPU unfriendness with GC

    • Difficulties compiling CUDA kernels • Lack of mkmf features • Incompatibility with Numo is required in reduction kernels for performance • Broadcast operations were slow 32
  25. GPU unfriendness with GC • One criteria to perform GC

    in Ruby is main memory usage (malloc_limit) • GPU memory usage is not taken into account • In the case of CuPy, because Python uses reference counting, we could release GPU memory immediately after the array object is not referenced anymore. 33 def add
 a = Cumo::DFloat.ones(3, 5)
 b = Cumo::DFloat.ones(3, 5)
 a + b end c = add a and b are not immediately freed (16VOGSJFOEOFTTXJUI($
  26. GPU unfriendness with GC (2) • (Partial) Solution • Added

    NArray#free to release memory to GPU on user-desired timing • Future work? • Something like NSAutoreleasePool to release all (or restricted) objects created inside a scope. 34 def add
 a = Cumo::DFloat.ones(3, 5)
 b = Cumo::DFloat.ones(3, 5)
 c = a + b a.free; b.free c end c = add a and b are immediately freed NSAutoreleasePool *pool = \ [[NSAutoreleasePool alloc] init]; NSObject *obj = \ [[[NSObject alloc] init] autorelease]; .... [pool release]; (16VOGSJFOEOFTTXJUI($
  27. 35 • Need to use nvcc (NVIDIA CUDA Compiler) instead

    of gcc to compile CUDA kernels. • However, mkmf supports to specify only CC and CXX compilers (no .cu file) • Solution: Made a wrapper ruby script • For files with .cu extensions, use nvcc • For files with .c extensions, use gcc Lack of mkmf features %J⒏DVMUJFTDPNQJMJOH$6%"LFSOFMT
  28. 36 • Numo returns a Ruby numeric object for reduction

    kernels (for cases of 0-dimensional NArray). • In Cumo, needs to copy GPU memory to host memory to create a Ruby nemeric object. • It results in synchronization with CPU. • Solution: Introduced partial incompatibility with Numo to return 0-dimensional NArray. Incompatibility with Numo is required in reduction for performance Numo::Int64.ones(2, 3).sum #=> 6
 Cumo::Int64.ones(2, 3).sum #=> Cumo::Int64#shape=[] 6 Returns a 0-dimensional NArray instead of a Ruby numeric object to avoid CPU and GPU synchronization. *ODPNQBUJCJMJUZXUJI/VNPJTSFRVJSFEJO3FEVDUJPO
  29. 1 2 3 37 • Broadcast Broadcast operations were slow

    #SPBEDBTUPQFSBUJPOTXFSFTMPX 1 2 3 4 5 6 7 8 9 10 11 12 + 1 2 3 = 2 4 6 5 7 9 8 10 12 11 13 15 1 2 3 1 2 3 4 x 3 1 x 3 4 x 3
  30. 38 How Numo Treats Broadcast Example) 1000 x 3 array

    + 1 x 3 array user loop: loop for 3 narray loop: loop for 1000 int nd = 1; int shape[] = {1000}; for (int i=0; i<nd;++i) { for (int j=0; j<shape[i]; ++j) { (*(nf->func))(&(lp->user)); } } int size = 3; for (int i=0; i<size;++i) { p3[i] = p1[i] + p[2]; } #SPBEDBTUPQFSBUJPOTXFSFTMPX
  31. 39 Launches Many CUDA Kernels user loop: loop for 3

    narray loop: loop for 1000 int nd = 1; int shape[] = {1000}; for (int i=0; i<nd;++i) { for (int j=0; j<shape[i]; ++j) { (*(nf->func))(&(lp->user)); } } __global__ void my_kernel( int* p3, int* p2, int* p1) { int i = blockIdx.x * blockDim.x + threadIdx.x; p3[i] = p1[i] + p2[i]; } Launches CUDA Kernels 1000 times. • In first implementation of Cumo, modified user loop implementation to CUDA kernels #SPBEDBTUPQFSBUJPOTXFSFTMPX
  32. 40 How Slow launching CUDA kernels Type Time(%) Time Calls

    Avg Min Max Name GPU activities: 99.89% 19.439ms 1000 19.439us 18.880us 21.312us cumo_sfloat_add API calls: 27.23% 330.78ms 13 25.445ms 35.083us 68.418ms cudaDeviceSynchronize 26.34% 319.98ms 1 319.98ms 319.98ms 319.98ms cuCtxCreate 25.32% 307.66ms 1477 208.30us 13.408us 275.62ms cudaMallocManaged 2.58% 18.703ms 1002 18.665us 16.184us 216.70us cudaLaunch nvprof • 18 micro second • Time to take cudaLaunch is almost equivalent with adding two arrays of 500,000 elements. • Also, there is a limit of CUDA queue size, e.g., 1,024. #SPBEDBTUPQFSBUJPOTXFSFTMPX
  33. 41 • Finally, stopped using ndloop Solution: Stop using ndloop

    /EMPPQJTTMPXGPS$VNP #SPBEDBTUPQFSBUJPOTXFSFTMPX a = Cumo::DFloat.ones(1000, 768)
 b = Cumo::DFloat.ones(1000, 1)
 a + b 56 times faster
  34. Outline 42 • Project Proposal • Cumo Features • Notices

    (or Difficulties) I met • Feature Proposals to Ruby • Project Achievement
  35. Inplace Math Operations • a += b is an abridged

    notation of a = a + b • Imagine a is a large matrix requiring 1GB. • a += b needs to allocate a new 1GB matrix. • Want to redefine for Cumo::NArray objects. • Current compromise: • Python allows to redefine +=. 44 https://bugs.ruby-lang.org/issues/14701 a.inplace + b 'FBUVSF1SPQPTBMTUP3VCZ
  36. Temporary Variable • In python, we can find a variable

    is a temporary or not by seeing reference counts • In NumPy, • is faster than • because (x + 1) is a temporary variable and new memory is not required to compute (x + 1) + 1 45 https://bugs.ruby-lang.org/issues/14710 y = x + 1 + 1 y = x + 1 y + 1 'FBUVSF1SPQPTBMTUP3VCZ
  37. Outline 46 • Project Proposal • Cumo Features • Notices

    (or Difficulties) I met • Feature Proposals to Ruby • Project Achievement
  38. Future Works • Support cuDNN for high performance convolutional networks

    • Support Float16 • Conversion between Numo::NArray and Cumo::NArray • CI server ... 48
  39. Supported Functions List 49 4VQQPSUFE'VODUJPOT-JTU - << atan2 eq floor

    log10 min_index rms stddev -@ >> atanh erf ge (>=) log1p minimum round store [] | cbrt erfc gemm log2 mulsum seq sum []= ~ ceil exp gt (>) logseq ne sign tan * acos coerce_cast exp10 hypot lt (<) nearly_eq signbit tanh ** acosh conj exp2 im max poly sin trunc / allocate copysign expm1 inspect max_index prod sinc var & asin cos extract ldexp maximum ptp sinh % asinh cosh eye le (<=) mean reciprocal sqrt ^ atan divmod fill log min rint square * 88 methods Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,
 SFloat (float), DFloat (double), SComplex, DComplex mixed
  40. Not Yet 50 4VQQPSUFE'VODUJPOT-JTU abs isnan set_real arg isneginf sort

    bincount isposinf sort_index clip median cumprod minmax cumsum modf frexp rand imag rand_norm isfinite real isinf set_imag [] count_false []= count_true & eq ^ extract | fill ~ mask all? none? any? store coerce_cast where copy where2 * 20 methods (most of all) IntXX, FloatXX, ComplexXX mixed Bit * 23 methods
  41. Acknowledgements • Ruby Association for Grant • Money - GPU

    machines cost much • Time keeper • Motivation • @mrkn for his mentoring on the grant • @masa16 for answering my questions about Numo • @hatappi and @naitoh for their work of red-chainer • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers (including me) of Chainer/CuPy for reference implementation • And, my wife for giving time to develop 51