Algorithms 3. Prac8cal Considera8ons 4. Challenges See (Bengio, Courville & Vincent 2013) “Unsupervised Feature Learning and Deep Learning: A Review and New Perspec8ves” and http://www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-aaai2013.html for a pdf of the slides and a detailed list of references.
Needs learning (involves priors + op#miza#on/search) • Needs generaliza0on (guessing where probability mass concentrates) • Needs ways to fight the curse of dimensionality (exponen8ally many configura8ons of the variables to consider) • Needs disentangling the underlying explanatory factors (making sense of the data) 3
features vs learning them • Good representa8on: captures posterior belief about explanatory causes, disentangles these underlying factors of varia8on • Representa8on learning: guesses the features / factors / causes = good representa8on of observed data. Representation Learning 4 raw input data represented by chosen features MACHINE LEARNING represented by learned features
increasing complexity/abstrac0on 5 x h3 h2 h1 … • poten8ally exponen8al gain in expressive power • brains are deep • humans organize knowledge in a composi8onal way • Beber MCMC mixing in space of deeper representa8ons (Bengio et al, ICML 2013) • They work! SOTA on industrial-‐scale AI tasks (object recogni0on, speech recogni0on, language modeling, music modeling)
Here predic8ng a supervised target Hidden layers These learn more abstract representa8ons as you head up Input layer This has raw sensory inputs (roughly) 7
Here predic8ng or condi8oning on a supervised target Hidden layers These learn more abstract representa8ons as you head up Input layer Inputs can be reconstructed, filled-‐in or sampled 8 2-‐way connec0ons
Dimensionality To generalize locally, need representa8ve examples for all relevant varia8ons! Classical solu8on: hope for a smooth enough target func8on, or make it smooth by handcraZing good features / kernel
Gaussian kernel machines need at least k examples to learn a func8on that has 2k zero-‐crossings along some line • Theorem: For a Gaussian kernel machine to learn some maximally varying func8ons over d inputs requires O(2d) examples (Bengio, Dellalleau & Le Roux 2007)
use very carefully hand-‐designed features and representa8ons Many prac88oners are very experienced – and good – at such feature design (or kernel design) “Machine learning” oZen reduces to linear models (including CRFs) and nearest-‐neighbor-‐like features/models (including n-‐ grams, kernel SVMs, etc.) Hand-‐craNing features is 0me-‐consuming, briOle, incomplete 17
es8ma8on & predic8on, decision trees, etc. • Parameters for each dis8nguishable region • # of dis0nguishable regions is linear in # of parameters #2 The need for distributed representations Clustering 18 à No non-‐trivial generaliza8on to regions without examples
Deep Learning, etc. • Each parameter influences many regions, not just local neighbors • # of dis0nguishable regions grows almost exponen0ally with # of parameters • GENERALIZE NON-‐LOCALLY TO NEVER-‐SEEN REGIONS #2 The need for distributed representations Mul8-‐ Clustering 19 C1 C2 C3 input Non-‐mutually exclusive features/ abributes create a combinatorially large set of dis8nguiable configura8ons
Clustering 20 Learning a set of features that are not mutually exclusive can be exponen8ally more sta8s8cally efficient than having nearest-‐neighbor-‐like or clustering-‐like models
(lots of) labeled training data But almost all data is unlabeled The brain needs to learn about 1014 synap8c strengths … in about 109 seconds Labels cannot possibly provide enough informa8on Most informa8on acquired in an unsupervised fashion 21
• They transfer knowledge from previous learning: • Representa8ons • Explanatory factors • Previous learning from: unlabeled data + labels for other tasks • Prior: shared underlying explanatory factors, in par0cular between P(x) and P(Y|x)
empirical evidence in favor of mul8ple levels of representa8on Exponen0al gain for some families of func0ons Biologically inspired learning Brain has a deep architecture Cortex seems to have a generic learning algorithm Humans first learn simpler concepts and then compose them into more complex ones 24
Polynomial expressed with shared components: advantage of depth may grow exponen8ally Theorems in (Bengio & Delalleau, ALT 2011; Delalleau & Bengio NIPS 2011)
Human languages, ideas, and ar8facts are composed from simpler components • Recursion: the same operator (same parameters) is applied repeatedly on different states/components of the computa8on • Result aZer unfolding = deep computa8on / representa8on xt-‐1 xt xt+1 zt-‐1 zt zt+1 27 (Bobou 2011, Socher et al 2011)
(tens of thousands!) is crucial to approach AI • Deep architectures learn good intermediate representa8ons that can be shared across tasks (Collobert & Weston ICML 2008, Bengio et al AISTATS 2011) • Good representa8ons that disentangle underlying factors of varia8on make sense for many tasks because each task concerns a subset of the factors 28 raw input x task 1 output y1 task 3 output y3 task 2 output y2 Task A Task B Task C Prior: shared underlying explanatory factors between tasks E.g. dic8onary, with intermediate concepts re-‐used across many defini8ons
Tradi8onal ML: data = matrix • Rela8onal learning: mul8ple sources, different tuples of variables • Share representa8ons of same types across data sources • Shared learned representa8ons help propagate informa8on among data sources: e.g., WordNet, XWN, Wikipedia, FreeBase, ImageNet… (Bordes et al AISTATS 2012, ML J. 2013) • FACTS = DATA • Deduc0on = Generaliza0on 29 person url event url words history person url event P(person,url,event) url words history P(url,words,history)
auto-‐encoders trained on images • some higher-‐level features more invariant to geometric factors of varia8on • (Glorot et al. 2011): sparse rec8fied denoising auto-‐ encoders trained on bags of words for sen8ment analysis • different features specialize on different aspects (domain, sen8ment) 32 WHY?
learned representa8on (prefer 0s in the representa8on) • Informa8on disentangling (compare to dense compression) • More likely to be linearly separable (high-‐dimensional space) • Locally low-‐dimensional representa8on = local chart • Hi-‐dim. sparse = efficient variable size representa8on = data structure Few bits of informa8on Many bits of informa8on 33 Prior: only few concepts and aOributes relevant per example
Rectifier Neural Networks (Glorot,Bordes and Bengio AISTATS 2011), following up on (Nair & Hinton 2010) soZplus RBMs Leaky integrate-and-fire model Rectifier Neuroscience motivations Machine learning motivations Sparse representations f(x)=max(0,x) Outstanding results by Krizhevsky et al 2012 killing the state-‐of-‐the-‐art on ImageNet 1000: 1st choice Top-‐5 2nd best 27% err Previous SOTA 45% err 26% err Krizhevsky et al 37% err 15% err
ML models Just as human languages exploit composi8onality to give representa8ons and meanings to complex ideas Exploi8ng composi8onality gives an exponen8al gain in representa8onal power Distributed representa8ons / embeddings: feature learning Deep architecture: mul8ple levels of feature learning Prior: composi8onality is useful to describe the world around us efficiently 36
fast GPU-‐enabled predictors, the main advantage of representa8on learning is sta8s8cal: poten8al to learn from less labeled examples because of sharing of sta8s8cal strength: • Unsupervised pre-‐training and semi-‐supervised training • Mul8-‐task learning • Mul8-‐data sharing, learning about symbolic objects and their rela8ons 37
layers 3 layers ICML’2011 workshop on Unsup. & Transfer Learning NIPS’2011 Transfer Learning Challenge Paper: ICML’2012 Unsupervised and Transfer Learning Challenge + Transfer Learning Challenge: Deep Learning 1st Place
the algorithmic techniques … Before 2006 training deep architectures was unsuccessful (except for convolu8onal neural nets when used by people who speak French) What has changed? • New methods for unsupervised pre-‐training have been developed (variants of Restricted Boltzmann Machines = RBMs, regularized auto-‐encoders, sparse coding, etc.) • New methods to successfully train deep supervised nets even without unsupervised pre-‐training • Successful real-‐world applica8ons, winning challenges and bea8ng SOTAs in various areas, large-‐scale industrial apps 39
• Ability to train deep architectures by using layer-‐wise unsupervised learning, whereas previous purely supervised abempts had failed • Unsupervised feature learners: • RBMs • Auto-‐encoder variants • Sparse coding variants New York 40
in their android speech recognizer (both server-‐ side and on some phones with enough memory) • MicrosoZ uses DL in their speech recognizer • Error reduc8ons on the order of 30%, a major progress 41
task Hours of training data Deep net+HMM GMM+HMM same data GMM+HMM more data Switchboard 309 16.1 23.6 17.1 (2k hours) English Broadcast news 50 17.5 18.8 Bing voice search 24 30.4 36.2 Google voice input 5870 12.3 16.0 (lots more) Youtube 1400 47.6 52.3 42 (numbers taken from Geoff Hinton’s June 22, 2012 Google talk)
NIPS 2012 • Google incorporates DL in Google+ photo search, “A step across the seman8c gap” (Google Research blog, June 12, 2013) • Baidu now offers with similar services 43 1st choice Top-‐5 2nd best 27% err Previous SOTA 45% err 26% err Krizhevsky et al 37% err 15% err baby car
service (audio video indexing), based on Hinton/Toronto’s DBNs (Mohamed et al 2012) • Google uses DL in its Google Goggles service, using Ng/Stanford DL systems, and in its Google+ photo search service, using deep convolu8onal nets • NYT talks about these: http://www.nytimes.com/2012/06/26/technology/in-a- big-network-of-computers-evidence-of-machine-learning.html?_r=1 • Substan8ally bea8ng SOTA in language modeling (perplexity from 140 to 102 on Broadcast News) for speech recogni8on (WSJ WER from 16.9% to 14.4%) (Mikolov et al 2011) and transla8on (+1.8 BLEU) (Schwenk 2012) • SENNA: Unsup. pre-‐training + mul8-‐task DL reaches SOTA on POS, NER, SRL, chunking, parsing, with >10x beber speed & memory (Collobert et al 2011) • Recursive nets surpass SOTA in paraphrasing (Socher et al 2011) • Denoising AEs substan8ally beat SOTA in sen8ment analysis (Glorot et al 2011) • Contrac8ve AEs SOTA in knowledge-‐free MNIST (.8% err) (Rifai et al NIPS 2011) • Le Cun/NYU’s stacked PSDs most accurate & fastest in pedestrian detec8on and DL in top 2 winning entries of German road sign recogni8on compe88on 44
JMLR 2003 “A Neural ProbabilisKc Language Model” • Each word represented by a distributed con8nuous-‐ valued code vector = embedding • Generalizes to sequences of words that are seman8cally similar to training sequences 46
Seman8c rela8ons appear as linear rela8onships in the space of learned representa8ons • King – Queen ≈ Man – Woman • Paris – France + Italy ≈ Rome 48 Paris France Italy Rome
1 2 3 2n 1 2 3 … n = universal approximator 2 layers of Logic gates Formal neurons RBF units Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011) Some functions compactly represented with k layers may require exponential size with 2 layers RBMs & auto-encoders = universal approximator
same time If we feed a vector of inputs through a bunch of logis8c regression func8ons, then we get a vector of outputs But we don’t have to decide ahead of 8me what variables these logis8c regressions are trying to predict! 58
same time … which we can feed into another logis8c regression func8on and it is the training criterion that will decide what those intermediate binary target variables should be, so as to make a good job of predic8ng the targets for the next layer, etc. 59
= successors of 1. Fprop: visit nodes in topo-‐sort order -‐ Compute value of node given predecessors 2. Bprop: -‐ ini8alize output gradient = 1 -‐ visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors Single scalar output 67
parameterized func8on over different 8me steps or nodes of a DAG • Output state at one 8me-‐step / node is used as input for another 8me-‐step / node A small crowd quietly enters the historic church historic the quietly enters S VP Det. Adj. NP VP A small crowd NP NP church N. Semantic Representations xt−1 xt xt+1 zt−1 zt zt+1 68
• (e.g., shortest path in HMM, best output configura8on in CRF) • E.g. Max over configura8ons or sum weighted by posterior • The loss to be op8mized depends on these choices • The inference opera8ons are flow graph nodes • If con8nuous, can perform stochas8c gradient descent • Max(a,b) is con8nuous. 69
inferred from the symbolic expression of the fprop. • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output. • Easy and fast prototyping 70
even without unsupervised pre-‐ training, thanks to beber ini8aliza8on and non-‐lineari8es (rec8fiers, maxout) and they can generalize well with large labeled sets and dropout. • Unsupervised pre-‐training s8ll useful for rare classes, transfer, smaller labeled sets, or as an extra regularizer. 71
of feature detectors (Hinton et al 2012, arXiv) • Dropouts trick: during training mul8ply neuron output by random bit (p=0.5), during test by 0.5 • Used in deep supervised networks • Similar to denoising auto-‐encoder, but corrup8ng every layer • Works beber with some non-‐lineari8es (rec8fiers, maxout) (Goodfellow et al. ICML 2013) • Equivalent to averaging over exponen8ally many architectures • Used by Krizhevsky et al to break through ImageNet SOTA • Also improves SOTA on CIFAR-‐10 (18à16% err) • Knowledge-‐free MNIST with DBMs (.95à.79% err) • TIMIT phoneme classifica8on (22.7à19.7% err) 72
connec8vity across 8me/space • Sharing weights across 8me/space (transla8on equivariance) • Pooling (transla8on invariance, cross-‐channel pooling for learned invariances) 74 xt-‐1 xt xt+1 zt-‐1 zt zt+1 Recurrent nets (RNNs) can summarize informa8on from the past Bidirec8onal RNNs also summarize informa8on from the future
Factors reconstruc8on error vector Linear manifold reconstruc8on(x) x input x, 0-‐mean features=code=h(x)=W x reconstruc8on(x)=WT h(x) = WT W x W = principal eigen-‐basis of Cov(X) Probabilis8c interpreta8ons: 1. Gaussian with full covariance WT W+λI 2. Latent marginally iid Gaussian factors h with x = WT h + noise 76 … code= latent features h … input reconstruction
P(h2 )… • Different priors: • PCA: P(hi ) is Gaussian • ICA: P(hi ) is non-‐parametric • Sparse coding: P(hi ) is concentrated near 0 • Likelihood is typically Gaussian x | h with mean given by WT h • Inference procedures (predic8ng h, given x) differ • Sparse h: x is explained by the weighted addi8on of selected filters hi = .9 x + .8 x + .7 x 77 h1 h2 h3 x1 x2 h4 h5 x W1 W3 W5 h1 h3 h5 W1 W5 W3 factors prior likelihood
à Deep Belief Network (DBN) • PCA is great but can’t be stacked into deeper more abstract representa8ons (linear x linear = linear) • One of the big ideas from Hinton et al. 2006: layer-‐wise unsupervised feature learning
• Failure of training deep supervised nets before 2006 • Regulariza8on effect vs op8miza8on effect of unsupervised pre-‐training • Is op8miza8on difficulty due to • ill-‐condi8oning? • local minima? • both? • The jury is s8ll out, but we now have success stories of training deep supervised nets without unsupervised pre-‐training
& Bengio, AISTATS 2010) Study the ac8va8ons and gradients • wrt depth • as training progresses • for different ini8aliza8ons à big difference • for different non-‐lineari8es à big difference First demonstra8on that deep supervised nets can be successfully trained almost as well as with unsupervised pre-‐ training, by se€ng up the op8miza8on problem appropriately…
Deep Architectures for AI”. • See Hinton (2010) “A pracKcal guide to training Restricted Boltzmann Machines” Undirected Models: the Restricted Boltzmann Machine [Hinton et al 2006] • Probabilis8c model of the joint distribu8on of the observed variables (inputs alone or inputs and targets) x • Latent (hidden) variables h model high-‐order dependencies • Inference is easy, P(h|x) factorizes into product of P(hi | x) h1 h2 h3 x1 x2
from P(X1 ,X2 ,…Xn ) • Gibbs sampling • Iterate or randomly choose i in {1…n} • Sample Xi from P(Xi | X1 ,X2 ,…Xi-‐1 , Xi+1 ,…Xn ) can only make small changes at a 8me! à slow mixing Note how fixed point samples from the joint. Special case of Metropolis-‐Has8ngs. • Block Gibbs sampling (not always possible) • X’s organized in blocks, e.g. A=(X1 ,X2 ,X3 ), B=(X4 ,X5 ,X6 ), C=… • Do Gibbs on P(A,B,C,…), i.e. • Sample A from P(A|B,C) • Sample B from P(B|A,C) • Sample C from P(C|A,B), and iterate… • Larger changes à faster mixing 98 A B C x9 x8 x7 x1 x2 x3 x4 x5 x6
during training, density smeared out, mode bumps overlap • Later on, hard to cross empty voids between modes 100 Are we doomed if we rely on MCMC during training? Will we be able to train really large & complex models? Training updates Mixing vicious circle
proper choice of parameters) guarantees increasing likelihood • With enough hidden units, can perfectly model any discrete distribu8on • RBMs with variable # of hidden units = non-‐parametric (Le Roux & Bengio 2008)
Gibbs chain at observed x, run k Gibbs steps SML/Persistent CD: (PCD) run nega8ve Gibbs chain in background while weights slowly change Fast PCD: two sets of weights, one with a large learning rate only used for nega8ve phase, quickly exploring modes Herding: Determinis8c near-‐chaos dynamical system defines both learning and sampling Tempered MCMC: use higher temperature to escape modes
chain at observed x, run k Gibbs steps (Hinton 2002) Sampled x- negative phase Observed x+ positive phase h+ ~ P(h|x+) h-~ P(h|x-) k = 2 steps x+ x- Free Energy push down push up
values for the hidden and visible units: • Hinton et al 2006: binary-‐binary RBMs • Welling NIPS’2004: exponen8al family units • Ranzato & Hinton CVPR’2010: Gaussian RBM weaknesses (no condi8onal covariance), propose mcRBM • Ranzato et al NIPS’2010: mPoT, similar energy func8on • Courville et al ICML’2011: spike-‐and-‐slab RBM 112
nets’ structure = computa8onal graph for P(y|x) • Graphical model’s structure ≠ computa8onal graph for inference • Recurrent nets & graphical models è family of computa0onal graphs sharing parameters • Could we have a parametrized family of computaKonal graphs defining “the model”? 116
e.g. • With bobleneck, code = new coordinate system • Encoder and decoder can have 1 or more layers • Training deep auto-‐encoders notoriously difficult Simple Auto-Encoders … code= latent features … encoder decoder input reconstruc8on 117 r(x) x h
(Bengio & Delalleau 2009): • CD-‐2k es8mates the log-‐likelihood gradient from 2k diminishing terms of an expansion that mimics the Gibbs steps • reconstruc8on error gradient looks only at the first step, i.e., is a kind of mean-‐field approxima8on of CD-‐0.5
holes in ||r(x)-‐x||2 or –log P(x | h(x)) at examples • Vector r(x)-‐x points in direc8on of increasing prob., i.e. es8mate score = d log p(x) / dx: learn score vector field = local mean • Generalize (valleys) in between above holes to form manifolds • d r(x) / dx es8mates the local covariance and is linked to the Hessian d2 log p(x) / dx2 • A Markov Chain associated with AEs es0mates the data-‐ genera0ng distribu0on (Bengio et al, arxiv 1305.663, 2013) 119
beber than ordinary training of a deep neural network (Bengio et al. NIPS’2006). Has been used successfully on large labeled datasets, where unsupervised pre-‐training did not make as much of an impact.
training phase with RBMs or auto-‐encoders on MNIST • Supervised phase with or without unsupervised updates, with or without fine-‐tuning of hidden layers • Can train all RBMs at the same 8me, same results
• -‐ Σi xi log ri (x) + (1-‐xi ) log(1-‐ri (x)) (with 0<ri (x)<1) or log-‐likelihood reconstruc8on criterion, e.g., for a mul8nomial (one-‐hot) input • -‐ Σi xi log ri (x) (where Σi ri (x)=1, summing over subset of inputs associated with this mul8nomial variable) • In general: consider what are appropriate loss func8ons to predict each of the input variables, typically, reconstruc0on neg. log-‐likelihood –log P(x|h(x)) 123
a lower dimensional “manifold” (region of high density with only few opera8ons allowed which allow small changes while staying on the manifold) -‐ variable dimension locally? -‐ SoZ # of dimensions?
input during training only • Train to reconstruct the uncorrupted input KL(reconstruction | raw input) Hidden code (representation) Corrupted input Raw input reconstruction • Encoder & decoder: any parametriza8on • As good or beber than RBMs for unsupervised pre-‐training
higher probability direc8on (Alain & Bengio 2013) • Some DAEs correspond to a kind of Gaussian RBM with regularized Score Matching (Vincent 2011) [equivalent when noiseà0] • Compared to RBM: No par88on func8on issue, + can measure training criterion Corrupted input Corrupted input prior: examples concentrate near a lower dimensional “manifold” r(x)-‐x dlogp(x)/dx /
• Minimizing reconstruc8on error forces to keep varia8ons along manifold. • Regularizer wants to throw away all varia8ons. • With both: keep ONLY sensi8vity to varia8ons ON the manifold.
near 0): few responsive units represent the ac8ve subspace (local chart) Contractive Auto-Encoders (Rifai, Vincent, Muller, Glorot, Bengio ICML 2011; Rifai, Mesnil, Vincent, Bengio, Dauphin, Glorot ECML 2011; Rifai, Dauphin, Vincent, Bengio, Muller NIPS 2011) Each region/chart = subset of ac8ve hidden units Neighboring region: one of the units becomes ac8ve/inac8ve SHARED SET OF FILTERS ACROSS REGIONS, EACH USING A SUBSET
1. Semi-‐supervised hypothesis (P(x) related to P(y|x)) 2. Unsupervised manifold hypothesis (data concentrates near low-‐dim. manifolds) 3. Manifold hypothesis for classifica8on (low density between class manifolds) (Rifai et al NIPS 2011)
1. Es8mate local principal direc8ons of varia8on U(x) by CAE (principal singular vectors of dh(x)/dx) 2. Penalize f(x)=P(y|x) predictor by || df/dx U(x) || Makes f(x) insensi8ve to varia8ons on manifold at x, tangent plane characterized by U(x).
regularized Auto-‐Encoders • But no explaining away (compe88on between causes) • (Coates et al 2011): even when training filters as RBMs it helps to perform addi8onal explaining away (e.g. plug them into a Sparse Coding inference), to obtain beber-‐classifying features • RBMs would need lateral connec8ons to achieve similar effect • Auto-‐Encoders would need to have lateral recurrent connec8ons or deep recurrent structure 141
model: • One of the first unsupervised feature learning algorithms with non-‐linear feature extrac8on (but linear decoder) MAP inference recovers sparse h although P(h|x) not concentrated at 0 • Linear decoder, non-‐parametric encoder • Sparse Coding inference: convex but expensive op8miza8on 142
by a parametric encoder: Predic8ve Sparse Decomposi8on (Kavukcuoglu et al 2008) • Very successful applica8ons in machine vision with convolu8onal architectures 143
unsupervised deep Boltzmann machine helps a lot • Ini8alizing each layer of a supervised neural network as an RBM, auto-‐encoder, denoising auto-‐encoder, etc can help a lot • Helps most the layers further away from the target • Not just an effect of the unsupervised prior • Jointly training all the levels of a deep architecture is difficult because of the increased non-‐linearity / non-‐smoothness • Ini8alizing using a level-‐local learning algorithm is a useful trick • Providing intermediate-‐level targets can help tremendously (Gulcehre & Bengio ICLR 2013)
Salakhutdinov 2006) • Stack encoders / P(h|x) into deep encoder • Stack decoders / P(x|h) into deep decoder 148 x h3 h2 h1 x h3 h2 h1 h1 h2 x h2 h1 ^ ^ ^ W1 W2 W3 W1 W1 T W2 W2 T W3 W3 T
AISTATS 2009) • Halve the RBM weights because each layer now has inputs from below and from above • Posi8ve phase: (mean-‐field) varia8onal inference = recurrent AE • Nega8ve phase: Gibbs sampling (stochas8c units) • train by SML/PCD 151 x h3 h2 h1 W1 ½W1 W1 T ½W1 W2 ½W2 T W3 ½W1 T ½W1 T ½W2 ½W2 T ½W2 ½W3 T ½W3 ½W3 T
ICML 2012) • MCMC on top-‐level auto-‐encoder • ht+1 = encode(decode(ht ))+σ noise where noise is Normal(0, d/dh encode(decode(ht ))) • Then determinis8cally propagate down with decoders 152 x h3 h2 h1
that defines a transi0on operator for a Markov chain whose asympto0c distribu0on is implicitly es0mated by the model • Noise injected in input and hidden layers • Trained to max. reconstruc8on prob. of example at each step • Example structure inspired from the DBM Gibbs chain: 153 1" x 0" h 3" h 2" h 1" W 1" W 1" W 1" T" W 1" W 2" W 2" T" W 3" W 1" T" W 1" T" W 2" W 2" T" W 2" W 3" T" W 3" W 3" T" sample"x 1 " sample"x 2 " sample"x 3 " target" target" target" noise noise 3 to 5 steps (Bengio, Yao, Alain & Vincent, arxiv 2013; Bengio & Laufer, arxiv 2013)
es8mators of the data-‐genera8ng distribu8on through their Markov chain, so long as they consistently es8mate the condi8onal denoising distribu8on and the Markov chain converges. Making P✓n (X| ˜ X) match P(X| ˜ X) makes ⇡n(X) match P(X) truth denoising distr. sta8onary distr. truth
probability into a parametrized noise-‐dependent part and a noise-‐ independent part , we also get a consistent es8mator of the data genera8ng distribu8on, if the chain converges. 157 1" x 0" h 3" h 2" h 1" W 1" W 1" W 1" T" W 1" W 2" W 2" T" W 3" W 1" T" W 1" T" W 2" W 2" T" W 2" W 3" T" W 3" W 3" T" sample"x 1 " sample"x 2 " sample"x 3 " target" target" target" noise noise
denoising auto-‐encoder architecture, single hidden layer with noise only injected in input • Factored Bernouilli reconstruc8on prob. distr. • = parameter-‐less, salt-‐and-‐pepper noise on top of X • Generalizes (Alain & Bengio ICLR 2013): not just conKnuous r.v., any training criterion (as log-‐likelihood), not just Gaussian but any corrupKon (no need to be Kny to correctly esKmate distribuKon). 160 x0 W1 W1 W1 T W1 W1 T W1 T sample x1 sample x2 target sample x3
(Breuleux et al 2011, Rifai et al 2012, Bengio et al 2013): • Generate 10000 samples from model • Use them as training examples for Parzen density es8mator • Evaluate its log-‐likelihood on MNIST test data 162 Training examples
a GSN can provably sample from any condi8onal over subsets of its inputs, so long as we use the condi8onal associated with the reconstruc8on distribu8on and clamp the right-‐hand side variables. (Bengio & Laufer arXiv 2013) 163
“Prac8cal Recommenda8ons for Gradient-‐ Based Training of Deep Architectures” • Unsupervised pre-‐training • Stochas8c gradient descent and se€ng learning rates • Main hyper-‐parameters • Learning rate schedule • Early stopping • Minibatches • Parameter ini8aliza8on • Number of hidden units • L1 and L2 weight decay • Sparsity regulariza8on • Debugging • How to efficiently search for hyper-‐parameter configura8ons 167
update, SGD updates aZer only 1 or few examples: • L = loss func8on, zt = current example, θ = parameter vector, and εt = learning rate. • Ordinary gradient descent is a batch method, very slow, should never be used. 2nd order batch method are being explored as an alterna8ve but SGD with selected learning schedule remains the method to beat. Stochastic Gradient Descent (SGD) 168
the same for all parameters. • Collobert scales them by the inverse of square root of the fan-‐in of each neuron • Beber results can generally be obtained by allowing learning rates to decrease, typically in O(1/t) because of theore8cal convergence guarantees, e.g., with hyper-‐parameters ε0 and τ. • New papers on adap8ve learning rates procedures (Schaul 2012, 2013), Adagrad (Duchi et al 2011 ), ADADELTA (Zeiler 2012) 169
many different training runs for each value of hyper-‐parameter for #itera8ons) • Monitor valida8on error during training (aZer visi8ng # of training examples = a mul8ple of valida8on set size) • Keep track of parameters with best valida8on error and report them at the end • If error does not improve enough (with some pa8ence), stop. 170
networks (or possibly recursive ones), the gradient is a product of Jacobian matrices, each associated with a step in the forward computa8on. This can become very small or very large quickly [Bengio et al 1994], and the locality assump8on of gradient descent breaks down. • Two kinds of problems: • sing. values of Jacobians > 1 à gradients explode • or sing. values < 1 à gradients shrink & vanish 171
Mikolov is to clip gradients to a maximum NORM value. Makes a big difference in Recurrent Nets (Pascanu et al ICML 2013) Allows SGD to compete with HF op8miza8on on difficult long-‐term dependencies tasks. Helped to beat SOTA in text compression, language modeling, speech recogni8on. 174 xt-‐1 xt xt+1 zt-‐1 zt zt+1
output (or reconstruc8on) biases to op8mal value if weights were 0 (e.g. mean target or inverse sigmoid of mean target). • Ini8alize weights ~ Uniform(-‐r,r), r inversely propor8onal to fan-‐ in (previous layer size) and fan-‐out (next layer size): for tanh units (and 4x bigger for sigmoid units) (Glorot & Bengio AISTATS 2010) 178
the input, which is sparse and high-‐ dimensional; Language models have a huge output space (1 unit per word). … code= latent features … sparse input dense output probabilities cheap expensive 179 categories words within each category • (Dauphin et al, ICML 2011) Reconstruct the non-‐zeros in the input, and reconstruct as many randomly chosen zeros, + importance weights • (Collobert & Weston, ICML 2008) sample a ranking loss • Decompose output probabili8es hierarchically (Morin & Bengio 2005; Blitzer et al 2005; Mnih & Hinton 2007,2009; Mikolov et al 2011)
safely try new models. • Theano Library (python) does it symbolically. Other neural network packages (Torch, Lush) can compute gradients for any given run-‐8me value. (Bergstra et al SciPy’2010) 180
Common approach: manual + grid search • Grid search over hyperparameters: simple & wasteful • Random search: simple & efficient • Independently sample each HP, e.g. l.rate~exp(U[log(.1),log(.0001)]) • Each training trial is iid • If a HP is irrelevant grid search is wasteful • More convenient: ok to early-‐stop, con8nue further, etc. 181
2009; Bergstra et al NIPS 2011; Thornton et al arXiv 2012; Snoek et al NIPS 2012) • Iterate • Es8mate P(valid. err | hyper-‐params config x, D) • choose op8mis8c x, e.g. maxx P(valid. err < current min. err | x) • train with config x, observe valid. err. v, D ß D U {(x,v)} 182
Hyper-‐parameters (layer size, regulariza8on, possibly learning rate) • Use mul8-‐core machines, clusters and random sampling for cross-‐valida8on or sequen8al model-‐ based op8miza8on 184
• Only by a small constant factor, and much more compact than non-‐parametric (e.g. n-‐gram models or kernel machines) • Very fast during inference/test 8me (feed-‐forward pass is just a few matrix mul8plies) • Need more training data? • Can handle and benefit from more training data (esp. unlabeled), suitable for Big Data (Google trains nets with a billion connec8ons, [Le et al, ICML 2012; Dean et al NIPS 2012]) • Actually needs less labeled data 185
hypothesis: • Unsupervised component forces model close to P(x) • Representa8ons good for P(x) are good for P(y|x) • Op8miza8on hypothesis: • Unsupervised ini8aliza8on near beber local minimum of P(y|x) • Can reach lower local minimum otherwise not achievable by random ini8aliza8on • Easier to train each layer using a layer-‐local criterion (Erhan et al JMLR 2010)
in func8on space • Color = epoch • Top: trajectories w/o pre-‐training • Each trajectory converges in different local min. • No overlap of regions with and w/o pre-‐ training
in different local min. • With ISOMAP, try to preserve geometry: pretrained nets converge near each other (less variance) • Good answers = worse than a needle in a haystack (learning dynamics)
and NLP hinged on faster compu8ng, GPUs, and large datasets • A 100-‐fold speedup is possible without wai8ng another 10yrs? • Challenge of distributed training • Challenge of condi8onal computa8on 192
a small fraction of parameters / example • Deep nets vs decision trees • Hard mixtures of experts • Condi8onal computa8on for deep nets: sparse distributed gaters selec8ng combinatorial subsets of a deep net • Challenges: • Back-‐prop through hard decisions • Gated architectures explora8on • Symmetry breaking to reduce ill-‐condi8oning 193
• Large minibatches + 2nd order methods • Asynchronous SGD (Bengio et al 2003, Le et al ICML 2012, Dean et al NIPS 2012) • Bobleneck: sharing weights/updates among nodes • New ideas: • Low-‐resolu8on sharing only where needed • Specialized condi8onal computa8on (each computer specializes in updates to some cluster of gated experts, and prefers examples which trigger these experts) 194
underfi€ng • Marginal u0lity of wider MLPs decreases quickly below memoriza8on baseline • Current limita8ons: local minima or ill-‐condi8oning? • Adap8ve learning rates and stochas8c 2nd order methods • Condi8onal comp. & sparse gradients à beber condi8oning: when some gradients are 0, many cross-‐deriva8ves are also 0. 195
trained models • Early during training, density smeared out, mode bumps overlap • Later on, hard to cross empty voids between modes 197 Are we doomed if we rely on MCMC during training? Will we be able to train really large & complex models? Training updates Mixing vicious circle
and stacked Contrac8ve Auto-‐Encoders: 1. MCMC sampling from top layer model 2. Propagate top-‐level representa8ons to input-‐level repr. • Deeper nets visit more modes (classes) faster 198 x h2 h1 1-‐layer (RBM) 2-‐layer (CAE) (Bengio et al ICML 2013)
set between them when viewed in the learned representa8on-‐space, making the empirical distribu8on more uniform and unfolding manifolds Linear interpola8on at layer 1 Linear interpola8on at layer 2 3’s manifold 9’s manifold Linear interpola8on in pixel space
abstrac8ons è disentangling • E.g. reverse video bit, class bits in learned representa8ons: easy to Gibbs sample between modes at abstract level • Hypotheses tested and not rejected: • more abstract/disentangled representa8ons unfold manifolds and fill more the space • can be exploited for beber mixing between modes 200 Pixel space 9’s manifold 3’s manifold Representa8on space 9’s manifold 3’s manifold
complex inputs (e.g. in NLP: sense ambiguity, parsing, seman8c role) • Almost any inference mechanism can be combined with deep learning • See [Bobou, LeCun, Bengio 97], [Graves 2012] • Complex inference can be hard (exponen8ally) and needs to be approximate à learn to perform inference 201
output models • P(h|x) intractable because of many important modes • MAP, Varia8onal, MCMC approxima8ons limited to 1 or few modes • Approximate inference can hurt learning (Kulesza & Pereira NIPS’2007) • Mode mixing harder as training progresses (Bengio et al ICML 2013) 202 Training updates Mixing vicious circle
h • BAD! Exact inference? Nope. Just Pain. too many possible configura8ons of h • WORSE! Each learning step usually requires inference and/or sampling from P(h, x) 203
inference • Loopy belief prop. (Ross et al CVPR 2011, Stoyanov et al 2011) • Varia8onal mean-‐field (Goodfellow et al, ICLR 2013) • MAP (Kavukcuoglu et al 2008, Gregor & LeCun ICML 2010) 2. OpKmize parameters wrt criterion of interest, possibly decoupling from the genera8ve model’s parameters Learning can compensate for the inadequacy of approximate inference, taking advantage of specifics of the data distribu8on 206
Foreign speech uberance example, y=answer to ques8on: • 10 word segments • 100 plausible candidates per word • 106 possible segmenta8ons • Most configura8ons (999999/1000000) implausible • è 1020 high-‐probability modes • All known approximate inference scheme may break down if the posterior has a huge number of modes (fails MAP & MCMC) and not respec8ng a varia8onal approxima8on (fails varia8onal) 207
if there are poten8ally many true latent variables involved • Exploits structure in P(y|x) that persist even aZer summing h • But how do we generalize this idea to full joint-‐distribu8on learning and answering any ques8on about these variables, not just one? 208
by backprop (Bengio & Laufer, arxiv 1306.1091) • Avoid any explicit latent variables whose marginaliza0on is intractable, instead train a stochas0c computa0onal graph that generates the right {condi0onal} distribu0on. 209 1" x 0" h 3" h 2" h 1" W 1" W 1" W 1" T" W 1" W 2" W 2" T" W 3" W 1" T" W 1" T" W 2" W 2" T" W 2" W 3" T" W 3" W 3" T" sample"x 1 " sample"x 2 " sample"x 3 " target" target" target" noise noise 3 to 5 steps
auto-‐encoder is a consistent es8mator of the data genera8ng distribu8on (if the chain converges) • Same thing for Genera8ve Stochas8c Networks (so long as the reconstruc8on probability has enough expressive power to learn the required condi8onal distribu8on). 210 1" x 0" h 3" h 2" h 1" W 1" W 1" W 1" T" W 1" W 2" W 2" T" W 3" W 1" T" W 1" T" W 2" W 2" T" W 2" W 3" T" W 3" W 3" T" sample"x 1 " sample"x 2 " sample"x 3 " target" target" target" noise noise
-‐ figure out the underlying structure of the data -‐ avoid curse of dimensionality -‐ mix beber between modes • How to obtained beber disentangling???? 213
deep learning is to allow learning higher levels of abstrac8on • Higher-‐level abstrac8ons disentangle the factors of varia8on, which allows much easier generaliza8on and transfer 214
agent learns, it performs an approximate op8miza8on with respect to some endogenous objec8ve. 217 Hypothesis 2 • When the brain of a single biological agent learns, it relies on approximate local descent in order to gradually improve itself.
deeper computa8ons (going through more areas or more computa8onal steps in sequence over the same areas). 218 Hypothesis 4 • Learning of a single human learner is limited by effecKve local minima. Theore8cal and experimental results on deep learning suggest: Possibly due to ill-‐condi8oning, but behaves like local min
discover high-‐level abstrac8ons by chance because these are represented by a deep sub-‐network in the brain. 219 Hypothesis 6 • A human brain can learn high-‐level abstrac8ons if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-‐level abstrac8ons. Suppor8ng evidence: (Gulcehre & Bengio ICLR 2013)
an efficient evolu8onary operator, allowing rapid search in the space of memes, that helps humans build up beber high-‐level internal representa8ons of their world. How could language/education/ culture possibly help find the better local minima associated with more useful abstractions? More than random search: poten8al exponen8al speed-‐ up by divide-‐and-‐conquer combinatorial advantage: can combine solu8ons to independently solved sub-‐ problems
explana8ons for current x) • Minutes, hours: learning (local descent, like current DL) • Years, centuries: cultural evolu0on (global op8miza8on, recombina8on of ideas from other humans) 223
Stanford deep learning tutorials with simple programming assignments and reading list hbp://deeplearning.stanford.edu/wiki/ • ACL 2012 Deep Learning for NLP tutorial hbp://www.socher.org/index.php/DeepLearningTutorial/ • ICML 2012 Representa8on Learning tutorial hbp://www.iro.umontreal.ca/~bengioy/talks/deep-‐learning-‐ tutorial-‐2012.html • IPAM 2012 Summer school on Deep Learning hbp://www.iro.umontreal.ca/~bengioy/talks/deep-‐learning-‐tutorial-‐ aaai2013.html • More reading: Paper references in separate pdf, on my web page 224
library hbp://deeplearning.net/soZware/theano • Can do automa8c, symbolic differen8a8on • Senna: POS, Chunking, NER, SRL • by Collobert et al. hbp://ronan.collobert.com/senna/ • State-‐of-‐the-‐art performance on many tasks • 3500 lines of C, extremely fast and using very lible memory • Torch ML Library (C++ + Lua) hbp://www.torch.ch/ • Recurrent Neural Network Language Model hbp://www.fit.vutbr.cz/~imikolov/rnnlm/ • Recursive Neural Net and RAE models for paraphrase detec8on, sen8ment analysis, rela8on classifica8on www.socher.org 225
researchers from a wide variety of fields (no need to understand RKHS). • To make deep learning more accessible: release off-‐ the-‐shelf learning packages that handle hyper-‐ parameter op8miza8on, exploi8ng mul8-‐core or cluster at disposal of user. • Spearmint (Snoek) • HyperOpt (Bergstra) 226
• Int. Conf. on Learning Representa8on 2013 a huge success! • Industrial strength applica8ons in place (Google, MicrosoZ) • Room for more research: • Scaling computa8on even more • Beber op8miza8on • Ge€ng rid of intractable inference (in the works!) • Coaxing the models into more disentangled abstrac8ons • Learning to reason from incrementally added facts 227