Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Red Chainer and Cumo: Practical Deep Learning i...

Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

Naotoshi Seo

April 20, 2019
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  2. self.introduction • Yusaku Hatanaka (@hatappi) • Red Data Tools Member.

    • I'm creating a DNN Framework for Ruby! • Merpay, Inc from Jan 2019
  3. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  4. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  5. About Red Chainer • Deep Learning (DNN) Framework for Ruby.

    • This is created with Red Data Tools (https://red-data-tools.github.io/) • Red Data Tools is a project that provides data processing tools for Ruby. • Ported Chainer (Python) in Ruby. • I want you to do fun Deep Learning in Ruby
  6. MNIST 28 x 28 ɾ ɾ ɾ 784 ɾ ɾ

    ɾ 1000 unit ɾ ɾ ɾ Fully Connected Relu ɾ ɾ ɾ 1000 unit ɾ ɾ ɾ ɾ ɾ ɾ 10 unit ɾ ɾ ɾ 10 unit softmax cross entropy -8.644561 -10.105622 2.354139 Fully Connected Relu Fully Connected
  7. Red Chainer’s history 2017/10 First release 2017/08 First Commit 2019/03

    Correspondence to Chainer v3 2018/05 Convolutional Neural Network
  8. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  9. Collaboration with other DNN frameworks • For example, there are

    models and learned parameters in Chainer. • But you cannot use them in Red Chainer.
  10. What’s ONNX • ONNX is Open Neural Network Exchange Format.

    • community project created by Facebook and Microsoft. • ONNX goal is to make it possible for developers to use the right combinations of tools for their project. • Contents are expressed in Protocol Buffers.
  11. Protocol Buffers • released to the open source community by

    Google in 2008 • language-neutral, platform-neutral extensible mechanism for serializing structured data.
  12. ONNX Intermediate Representation • ONNX contains a list of parameters

    that make up the graph and a list of each compute node. • Learned parameters are stored in binary • Can be converted to Numo::NArray with Numo::NArray.from_binary • detail: https://github.com/onnx/onnx/blob/ master/docs/IR.md
  13. ONNX visualization • You can also visualize models from ONNX

    files! • Netron is a viewer for neural network, deep learning and machine learning models.
  14. Using ONNX with Ruby • menoh-ruby • Menoh (C++) is

    DNN inference library. • you can inference in Ruby using ONNX!
  15. What’s Automatic generation of Ruby code • github.com/hatappi/onnx-red-chainer • Output

    Ruby code of model and learned parameters for Red Chainer from ONNX file • use models and learned parameters with Red Chainer when inferring • you may change the model yourself and learn anew!
  16. Why Automatically Generate Ruby Code? • Red Chainer is for

    fun and deep learning in Ruby. • I want you to write a DNN model in Red Chainer using an existing model. • Of course you can do porting manually. • You can easily get started by creating it automatically.
  17. Summary (Red Chainer) • Red Chainer is a framework for

    having fun and deep learning in Ruby. • Currently supports Chainer v3, but v4 will continue to be developed. • Model and learned parameters in other frameworks can be used with Red Chainer by using onnx-red-chainer
  18. Self Introduction • Naotoshi Seo @sonots • The author of

    Cumo, CUDA aware numerical library for Ruby. • CRuby, and Chainer committer • ZOZO Technologies, Inc from Jan 2019. • Started MLOps team from this April. !38
  19. Outline • Project Introduction of Cumo (Review of Last Year

    Presentation) • What's new to Cumo • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • Introduction of ChainerX !39
  20. What is Cumo? • (NVIDIA) GPU version of Ruby/Numo •

    Pronounced like koo-mo • Ruby Association Grant 2017 !41 https://www.ruby.or.jp/en/news/20171206 Project Introduction https://github.com/sonots/cumo-logo
  21. Why GPU? • GPU is fast, and recently essential for

    Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU !42 Project Introduction • GPU is bad at branching • GPU simplifies branch prediction and out- of-order mechanism instead. • GPU is suitable for matrix computation
  22. !43 Project Introduction CUDA Memory Pool 1. Round up memory

    size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-fit with Coalescing (BFC), which is the one used in malloc(3)
  23. Element-wise Operation !44 Review of Last Year 40 times faster

    for size of 10^8 4J[F /VNP NT $VNP NT ?   ?   ?   ?   ?   a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b UIJT
 SFE (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 faster
  24. Dot product !45 831 times faster than Numo w/ BLAS

    for size of 10^8 UIJT (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT /VNP#-"4 $VNP NT ?    ?    ?    ?    ?    a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) UIJT
 ZFMMPX Review of Last Year faster
  25. Red-chainer mnist example !46 380 sec/epoch → 5 sec/epoch 75

    Times Faster !! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Review of Last Year faster
  26. Last Year's Way !48 Red Chainer Integration • Cumo is

    highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself.
  27. New Programmable Way !49 Red Chainer Integration require 'chainer' gpu

    = Chainer::CUDA.available? ? 0 : -1 xm = Chainer::Device.create(gpu).xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b
 
 Chainer.get_array_module(a) #=> Cumo
  28. Function CPU/GPU Branching !53 Red Chainer Integration class Convolution2DFunction <

    Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [x.conv(w, b, ...)] end end
  29. Convolutional Neural Networks (CNN) • In the 2012 Image Recognition

    Competition ImageNet Large Scale Visual Recognition Competition (ILSVRC), the method using a CNN called AlexNet won the first place, and DNN became famous. • It is necessary to support fast Convolution by Red Chainer, otherwise, you can tell Red Chainer is useless. !55 https://qiita.com/yu4u/items/7e93c454c9410c4b5427 Fast CNN with cuDNN
  30. What is cuDNN • The NVIDIA CUDA® Deep Neural Network

    library (cuDNN) is a GPU- accelerated library of primitives for deep neural networks. • Support highly tuned Convolution, Batch Normalization, Pooling, etc. • cuDNN accelerates widely used deep learning frameworks, including Caffe,Caffe2, Chainer, Keras,MATLAB, MxNet, TensorFlow, and PyTorch. • And, Red Chainer now. !56 • https://developer.nvidia.com/cudnn Fast CNN with cuDNN
  31. Cumo supports • conv(x, w, b, stride, pad) • conv_transpose

    • conv_grad_w • batch_norm(x, w, b, stride, pad) • batch_norm_backward • (max|avg)_pool • (max|avg)_pool_backward !57 Fast CNN with cuDNN
  32. More Algorithms !60 • Direct • Im2col • FFT •

    Winograd • It depends on input tensor size and available memory Which is best? Convolution
  33. Auto tune • cuDNN supports auto-tuning of algorithm • cudnnFindConvolutionForwardAlgorithm

    • cudnnFindConvolutionBackward(Data|Filter)Algorithm • They try all algorithms for the input data, and find the fastest one. • Cumo's convolution calls it on the first-call and caches results. !61 Convolution
  34. Convolution !62 3906 times faster than Numo for size of

    2^10 (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT $VNP NT ?   ?   ?   ?   ?   x = xm::Float32.ones(32, 3, size, size) w = xm::Float32.ones(2, 3, 3, 3) b = xm::Float32.ones(2) y = F.convolution_2d( x, w, b, stride: 2, pad: 1) Convolution UIJT
 SFE 3BUJP faster faster
  35. Red-chainer cifar example (resnet-18) !63 0.12 iters/sec → 3.8 iters/sec


    23 days → 17 hours to finish 32 Times Faster ! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Fast CNN with cuDNN TODO: red-chainer currently has logics computing in Ruby.
 Remove them, then we must be able achieve better performance. faster
  36. ChainerX = Numpy-Like ndarray + autograd in C++ • Written

    in C++ w/ a thin python binding • = far less host-side overhead • With pluggable device backends • = open to quickly add new device support • With pure C++ API • = available for Python-free apps !65 Speed Environment Support Quick Deployment ChainerX
  37. ChainerX Python API !68 Import chainerx as chx x =

    chx.ones((2, 3), dtype=chx.float32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward() • Numpy-like API • Provides NN functions such as • conv, batch_norm • Multiple device supports • Be differentiable by require_grad() ChainerX
  38. ChainerX Ruby API !69 require 'chainerx' x = ChainerX.ones([2, 3],

    dtype=ChainerX::Float32, device='cuda:0') y = (x + 1).require_grad z = ChaienrX.exp(y).sum z.backward • Numpy-like API to Ruby • Reuse core codes ChainerX
  39. Summary (Cumo) !71 • Project Introduction of Cumo (Review of

    Last Year Presentation) • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • 32 times faster! • Introduction of ChainerX • Ruby binding implementation is welcome!
  40. Acknowledgements !72 • Ruby Association • 2017 Grant and GPU

    server • My company, ZOZO Technologies, for travel support. • @hatappi and @naitoh for their work of red-chainer, Numo, and Cumo • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers of Chainer/CuPy/ChainerX (including me) as a reference implementation • And, my wife for giving time to develop