Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed TensorFlow

Kazunori Sato
February 07, 2016

Distributed TensorFlow

For HCJ 2016 LT session

Kazunori Sato

February 07, 2016
Tweet

More Decks by Kazunori Sato

Other Decks in Programming

Transcript

  1. +Kazunori Sato @kazunori_279 Kaz Sato Staff Developer Advocate Tech Lead

    for Data & Analytics Cloud Platform, Google Inc.
  2. Jupiter network 40 G ports 10 G x 100 K

    = 1 Pbps total CLOS topology Software Defined Network
  3. Borg No VMs, pure containers Manages 10K machines / Cell

    DC-scale proactive job sched (CPU, mem, disk IO, TCP ports) Paxos-based metadata store
  4. Google's open source library for machine intelligence • tensorflow.org launched

    in Nov 2015 • The second generation (after DistBelief) • Used in many production ML projects at Google What is TensorFlow?
  5. What is TensorFlow? • Tensor: N-dimensional array ◦ Vector: 1

    dimension ◦ Matrix: 2 dimensions • Flow: data flow computation framework (like MapReduce) • TensorFlow: a data flow based numerical computation framework ◦ Best suited for Machine Learning and Deep Learning ◦ Or any other HPC (High Performance Computing) applications
  6. Yet another dataflow systemwith tensors MatMul Add Relu biases weights

    examples labels Xent Edges are N-dimensional arrays: Tensors
  7. Yet another dataflow systemwith state Add Mul biases ... learning

    rate −= ... 'Biases' is a variable −= updates biases Some ops compute gradients
  8. Simple Example # define the network import tensorflow as tf

    x = tf.placeholder(tf.float32, [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b) # define a training step y_ = tf.placeholder(tf.float32, [None, 10]) xent = -tf.reduce_sum(y_*tf.log(y)) step = tf.train.GradientDescentOptimizer(0.01).minimize(xent)
  9. Simple Example # initialize session init = tf.initialize_all_variables() sess =

    tf.Session() sess.run(init) # training for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(step, feed_dict={x: batch_xs, y_: batch_ys})
  10. Portable • Training on: ◦ Data Center ◦ CPUs, GPUs

    and etc • Running on: ◦ Mobile phones ◦ IoT devices
  11. Denso IT Lab: • TIT TSUBAME2 supercomputer with 96 GPUs

    • Perf gain: dozens of times From: DENSO GTC2014 Deep Neural Networks Level-Up Automotive Safety From: http://www.titech.ac.jp/news/2013/022156.html Preferred Networks + Sakura: • Distributed GPU cluster with InfiniBand for Chainer • In summer, 2016
  12. Google Brain: Embarrassingly parallel for many years • "Large Scale

    Distributed Deep Networks", NIPS 2012 ◦ 10 M images on YouTube, 1.15 B parameters ◦ 16 K CPU cores for 1 week • Distributed TensorFlow: runs on hundreds of GPUs ◦ Inception / ImageNet: 40x with 50 GPUs ◦ RankBrain: 300x with 500 nodes
  13. Distributed TensorFlow • CPU/GPU scheduling • Communications ◦ Local, RPC,

    RDMA ◦ 32/16/8 bit quantization • Cost-based optimization • Fault tolerance
  14. Distributed TensorFlow • Fully managed ◦ No major changes required

    ◦ Automatic optimization • with Device Constraints ◦ hints for optimization /job:localhost/device:cpu:0 /job:worker/task:17/device:gpu:3 /job:parameters/task:4/device:cpu:0
  15. Model Parallelism vs Data Parallelism Model Parallelism (split parameters, share

    training data) Data Parallelism (split training data, share parameters)
  16. Data Parallelism • Google uses Data Parallelism mostly ◦ Dense:

    10 - 40x with 50 replicas ◦ Sparse: 1 K+ replicas • Synchronous vs Asynchronous ◦ Sync: better gradient effectiveness ◦ Async: better fault tolerance
  17. Summary • TensorFlow ◦ Portable: Works from data center machines

    to phones ◦ Distributed and Proven: scales to hundreds of GPUs in production ▪ will be available soon!
  18. Resources • tensorflow.org • TensorFlow: Large-Scale Machine Learning on Heterogeneous

    Distributed Systems, Jeff Dean et al, tensorflow.org, 2015 • Large Scale Distributed Systems for Training Neural Networks, Jeff Dean and Oriol Vinyals, NIPS 2015 • Large Scale Distributed Large Networks, Jeff Dean et al, NIPS 2012