pytexas2013

Trends in Deep Learning Kyle Kastner Southwest Research Institute (SwRI)
University of Texas - San Antonio (UTSA)

The W’s

Anatomy Sigmoid Tanh Rectified Linear (reLU) CS294A Notes, Andrew Ng
f(x)

• Input is L1 • Hidden is L2 • Output
is L3 • Bias Units ◦ +1 in diagram • 6-3-6 autoencoder shown • “Deep” is ill-defined Anatomy (c.) CS294A Notes, Andrew Ng Bias Unit

• Trained with cost and gradient of cost • Negative
log likelihood (supervised) • Mean squared error (unsupervised) • 3-3-1 classifier shown Anatomy (c.) CS294A Notes, Andrew Ng

Terminology Overfit • Performs well on training data, but poorly
on test data Hyperparameters • “Knobs” to be tweaked, specific to each algorithm

Terminology (c.) Learning Rate ◦ Amount to “step” in direction
of error gradient ◦ Very important hyperparameter (VI...H?) Dropout ◦ Equivalent to averaging many neural networks ◦ Randomly zero out weights for each training example ◦ Drop 20% input, 50% hidden Momentum ◦ Analogous to physics ◦ Want settle in the lowest “valley”

Ex1: Feature Learning • Autoencoder for feature extraction, 784-150-784 •
Input coded values (150) to classifier Score on raw features: 0.8961 Score on encoded features: 0.912 “Borat”, 20th Century Fox The Dawson Academy, thedawsonacademy.org

Ex2: Deep Classifier • Using reLU, add equal size layers
until overfit • Once overfitting, add dropout • 784-1000-1000-1000-1000-10 architecture • Example achieves ~1.8% error on MNIST • State of the art is < .8% on MNIST digits! www.frontiersin.org

Ex3: Deep Autoencoder • Once again, uses MNIST digits •
784-250-150-30-150-250-784 architecture • Very difficult to set hyperparameters • “Supermarket” search

In The Wild Applications • Google+ Image Search • Android
Speech to Text • Microsoft Speech Translation Conferences • PyTexas (whoo!) • SciPy, PyCon, PyData... • ICML, NIPS, ICLR, ICASSP, IJCNN, AAAI

In The Wild (c.) Python! • pylearn2 (http://github.com/lisa-lab/pylearn2) • theano-nets
(http://github.com/lmjohns3/theano-nets) • scikit-learn (http://github.com/scikit-learn/scikit-learn) • hyperopt (http://github.com/jbergstra/hyperopt) • Theano (https://github.com/Theano/Theano) Other • Torch (http://torch.ch) • Deep Learning Toolbox (http://github. com/rasmusbergpalm/DeepLearnToolbox)

References Pure python autoencoder: http://easymachinelearning.blogspot.com/p/sparse-auto- encoders.html Tutorial: http://deeplearning.net/tutorial/ CS249A Notes:
http://www.stanford.edu/class/cs294a/sparseAutoencoder. pdf

Questions? Slides and examples: http://github.com/kastnerkyle/PyTexas2013 theano-nets: http://github.com/lmjohns3/theano-nets Thank you!

BONUS! Time to spare?

Difficulties • Many “knobs” (hyperparameters) • Optimization is difficult •
Optimization can’t fix poor initialization • Compute power and time

Current Strategies Hyperparameters • Random search (Bergstra ‘12) • Don’t
grid search • Most hyperparameters have little effect Optimization ◦ Hessian Free (HF) (Martens ‘10) ◦ Stochastic Gradient Descent (SGD) ◦ Layerwise pretraining + finetuning (Hinton ‘06)

Current Strategies (c.) Dropout • Acts like “bagging” for neural
nets • Randomly zero out units (20% input, 50% hidden) Activations • Rectified linear (reLU) with dropout, classification • Sigmoid or tanh, autoencoder (no dropout!)

Current Strategies (c.) Initialization • Sparse initialization (Martens ‘10, Sutskever
‘13) • sqrt(6/fan) initialization (Glorot & Bengio, ‘10) Momentum (SGD) • Nesterov’s Accelerated Gradient (Sutskever ‘13) Learning Rate (SGD) • Adaptive learning rate (Schaul ‘13)

Moving Forward • Simplify hyperparameter whack-a-mole • Validate research results
• Apply to new datasets • Try to avoid a SKYNET situation...

pytexas2013

pytexas2013

Kyle Kastner

More Decks by Kyle Kastner

Other Decks in Programming

Featured

Transcript

Trends in Deep Learning Kyle Kastner Southwest Research Institute (SwRI)

The W’s

Anatomy Sigmoid Tanh Rectified Linear (reLU) CS294A Notes, Andrew Ng

• Input is L1 • Hidden is L2 • Output

• Trained with cost and gradient of cost • Negative

Terminology Overfit • Performs well on training data, but poorly

Terminology (c.) Learning Rate ◦ Amount to “step” in direction

Ex1: Feature Learning • Autoencoder for feature extraction, 784-150-784 •

Ex2: Deep Classifier • Using reLU, add equal size layers

Ex3: Deep Autoencoder • Once again, uses MNIST digits •

In The Wild Applications • Google+ Image Search • Android

In The Wild (c.) Python! • pylearn2 (http://github.com/lisa-lab/pylearn2) • theano-nets

References Pure python autoencoder: http://easymachinelearning.blogspot.com/p/sparse-auto- encoders.html Tutorial: http://deeplearning.net/tutorial/ CS249A Notes:

Questions? Slides and examples: http://github.com/kastnerkyle/PyTexas2013 theano-nets: http://github.com/lmjohns3/theano-nets Thank you!

BONUS! Time to spare?

Difficulties • Many “knobs” (hyperparameters) • Optimization is difficult •

Current Strategies Hyperparameters • Random search (Bergstra ‘12) • Don’t

Current Strategies (c.) Dropout • Acts like “bagging” for neural

Current Strategies (c.) Initialization • Sparse initialization (Martens ‘10, Sutskever

Moving Forward • Simplify hyperparameter whack-a-mole • Validate research results