Numerical computation
Directional derivative
- The directional derivative in direction $\mu$ (a unit vector) is the slope of he function f in direction u. Formally, it is the derivative of the function $f(x+\alpha\mu)$ with respect to $\alpha$, evaluated at $\alpha = 0$, this is an numerically correct expression, you can verify it by writing the derivative of the function of $\alpha$.
- Using the chain rule, we can see
To minimize f, we would like to find the direction in which f decreases the fastest. We can do this by
Where $\theta$ is the angle between $\mu$ and the gradient, substituting in and ignoring factors that do not depend on $\mu$, this simplifies to This is minimized when $\mu$ points in the opposite direction as the gradient.
This is known as the method of steepest descend , or gradient descent.
Steepest descent proposes a new point
Choose step size in different ways
- Set $\epsilon$ to a small constant.
- Evaluate for several values of $\epsilon$ and choose the one that results in the smallest objective function value.
- Line search.
Beyond the gradient: Jacobian and Hessian Matrices
Second derivative measures curvature.
If the gradient is 1, then we can make a step of size $\epsilon$ along the negative gradient, and the cost function will decrease by more than $\epsilon$ if the second derivative is negative, meaning that the function curves downward.
The hessian matrix H(f)(x) is defined such that:
Because the hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
The second derivative in a specific direction represented by a unit vector d is given by $d^{T}Hd$. This can be easily verified by taking the second derivative of and evaluate at $\alpha = 0$.
When d is an eigenvector of H, the second derivative in that direction is given by the corresponding eigenvalue. This can be verified by the definition of eigenvalue Ax = $\lambda$x.
For other directions of $d$, the directional second derivative is a weighted average of all the eigenvalues, with weights between 0 and 1, and eigenvectors that have a smaller angle with d receiving more weight. This can be verified by the decomposition of H times unit vector d.
The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalues determine the minimum second derivative.
To summary, the second derivative tells us how well we can expect a gradient descent step to perform.
Second- order Taylor series approximation to the function f(x) around the current point $x^{(0)}$:
Where g is the gradient and H is the Hessian.
If we use a learning rate of $\epsilon$, then the new point x will be given by $x^{(0)} - \epsilon g$. Substituting this into our approximation, we obtain There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When $g^{T}Hg$ is zero or negative, the Taylor series approximation predicts that increasing $\epsilon$ forever will decrease f forever.
In practice, the Taylor series is unlikely to remain accurate for large $\epsilon$. So one must resort to more heuristic choices of $\epsilon$ in this case. When $g^{T}Hg$ is positive, solving for the optimal step size that decreases the Taylor series yields In the worst case, when g aligns with the eigenvector of H corresponding to the maximal eigenvalue $\lambda_{max}$,(if g aligns with the eigenvector, then Hg = $\lambda$g) then this optimal step size is given by $1/\lambda_{max}$. The eigenvalues of the Hessian thus determine the scale of the learning rate.
Second derivative can be used to determine whether a critical point is a local maximum, a local minimum, or a saddle point. In univariate case, it’s trivial, we can draw a a picture of $x^2$ or $-x^2$ to help us understand.
In multiple dimensions, we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum or a saddle point.
When the Hessian matrix is positive definite (all its eigenvalues are positive), the point is local maximum. This can be seen by observing that the directional second derivative in any directional must be positive, and making reference to the univariate case.
When at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x is a local maximum on one cross section of f but a local minimum on another cross section.
The test is inconclusive whenever al the nonzero eigenvalues have the same sign but at least one eigenvalue is zero.
Condition number of the Hessian matrix
In multiple dimensions, there is a different second derivative for each direction at a single point.
The condition number of the Hessian at this point measures how much the second derivatives differ from each other.
When the Hessian has a poor condition number, gradient descent performs poorly. This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative. So it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.
Poor condition number also makes choosing a good step size difficult. The step size must be small enough to avoid overshooting the minimum and going uphill in directions with strong positive curvature. This usually means that the step size is too small to make significant progress in other directions with less curvature.
刚开始的时候step size还挺大的,后面由于curvature变得越来越大,步长就越来越小了。
Newton’s method
We need to solve this condition number problem.
This issue can be resolved by using information from the Hessian matrix to guide the search. The simplest method is Newton’s method.
The Newton’s method is based on using a second-order Tyaylor series expansion to approximate f(x) near some point $x_{(0)}$:
If we then solve for the critical point of this function, we obtain
Numerical Precision: a deep earning super kill
Often deep learning algorithms “sort of work”
Often deep learning algorithms “explode”
Culprit is often loss of numerical precision
Rounding and truncation errors:
- In a digital computer, we use float32 or similar schemes to represent real numbers.
- A real number x is rounded to x+delta for some small delta.
- Overflow: large x replaced by inf
- Underflow: small x replaced by 0 (very harmful)
Exp function:
- Overflows for large x
- Doesn’t need to be very large
- Float32:89 overflows
- Never use large x
- Underflows for very negative x, possibly catastrophic if exp(x) in denominator.
Subtraction:
Log and sqrt:
Log exp:
Which is the better hack?
The first one is better, because it avoids the derivative of sqrt(x) equals to 0.
Log(sum(exp))
Why does the logsumexp trick work?
Algebraically equivalent to the original version:
Miss one log in the last equation……
Softmax
Sigmoid
cross-entropy
Recommendations are using built-in library.
Bug hunting strategies:
Machine learning basics
Different tasks of machine learning
Classification with missing inputs
- Learn a set of functions, each function corresponds to classifying x with a different subset of its inputs missing.
- A more efficient way is to learn a probability distribution over all the relevant variables, then solve the classification task by marginalizing out the missing variables.
Transcription
- Optical character recognition
- Google street view uses deep learning to process address numbers
- Speech recognition
- Machine translation
- A sequence of symbols in some language converted to a sequence of symbols in another language
- Structured output
- The output values should be tightly interrelated, examples are as follows
- Mapping a natural language sentence into a tree that describes its grammatical structure by tagging nodes of the trees as being verbs, nouns, adverbs and so on
- Image captioning, the words produced by an image captioning must form a valid sentence
- Anomaly detection
- Credit card fraud detection
- Synthesis and sampling
- The machine is asked to generate new examples that are similar to those in the training data
- Video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel.
- Given the input, generate the specific kind of output. For example, we provide a written sentence and ask the program to emit an audio waveform containing a spoken version of that sentence.
- Denoising
- Predict the clean example x from its corrupted version xhat.
- Density estimation
- Learn a function $p_{model}: R^n $ to $R$.
The performance measure P
- Accuracy: the proportion of examples for which the model produces the correct output
- Log-probability the model assigns to some examples
- Sometimes hard to choose a performance measure that corresponds well to the desired behavior of the system.
- When doing regression task, should we penalize the system more if it frequently makes medium-sized mistakes or if it rarely makes very large mistakes
The experience E
- Unsupervised learning
- Superviesed learning
- Reinforcement learning algorithms interacts with an environment, which does not just experience a fixed dataset.
Generalization
- It requires our algorithm to perform well on new, previously unseen inputs, not just those on which our model was trained.
- Test error is generalization error
Occam’s razor
- This principle states that among competing hypotheses that explain known observations equally well, we should choose the “simplest” one.
Vapnik-Chervonenkis dimension
- Measure the capacity of a binary classifier.
- Defined as being the largest possible value of m for which there exists a training set of m different x points that the classifier can label arbitrarily.
Regularization
- Expressing preferences for one function over another is a more general way of controlling a model’s capacity than including or excluding members from the hypothesis space.
- We can think of excluding a function from a hypothesis space as expressing an infinitely strong preference against that function.
Validation set and hyperparameters
- What is hyperparameters?
- For example, in linear regression, the degree of the polynomial, which acts as a capacity hyperparameter.
- Why use validation set?
- Train hyperparameters
- Why not train hyperparameters using training data?
- Such hyperparameters would always choose the maximum model capacity, result in overfitting.
Statistics theory basic
- One can see my another blog: statistics theory
Maximum likelihood estimation
Because rescale the cost function does not change the argmax.
One way to interpret mle is to view it as minimizing the dissimilarity between the empirical distribution pdata and the model distribution
The dissimilarity between the two can be measured by the KL divergence.
Cross entropy:
Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.
Stochastic gradient descent
- Idea: gradient descent requires computing , as the training size grows larger, the time to take a single gradient step becomes long. The insight of SGD is to estimate the expectation using a small set of samples.
- On each step, sample a mini batch uniformly from the training set. And the estimate of the gradient is formed as
Challenges motivating deep learning
- The curse of dimensionality:
- In one dimension, when we generalize to a new data point, we usually tel what to do simply by inspecting the training examples that lie in the same cell as the new input
- When dimension is high, sometimes we cannot find the training data located in that space.
- We can usually suppose we have a variable taking value of 0 and 1. If we have 100 variables, we need at least 2^100 training data points to have a whole picture of the space.
Local constancy and smoothness regularization
Smoothness prior:
This prior states that the function we learn should not change very much within a small region
Many simpler algorithms rely exclusively on this prior to generalize well, and as a result, they fail to scale to the statistical challenges involved in solving AI-level tasks.
K-nearest neighbors family of learning algorithms is an extreme example of the local constancy approach.
Decision trees also suffer from the limitations of exclusively smoothness-based learning, because:
- It break the input space into as many regions as there are leaves and use a separate (or more)
- If a target function requires a tree with at least n leaves to be represented accurately, then at least n training examples are required to fit the tree.
If the function is complicated (we want to distinguish a huge number of regions compared to the number of examples), is there any hope to generalize well?
- Answer is yes
- The key insight is that a very large number of regions, such as $O(2^k)$, can be defined with $O(k)$ examples, so long as we introduce some dependencies between the regions through additional assumptions about the underlying data-generating distribution.
- In this way, we can generalize nonlocally.
- Many different deep learning algorithms provide implicit or explicit assumptions of nonlocality.
Manifold learning
A very nice answer on zhihu: https://www.zhihu.com/question/24015486
Essentially is Nonlinear dimension reduction.
Deep Feedforward Networks
Computational graph and backpropagation
- 博客资料 https://colah.github.io/posts/2015-08-Backprop/
正向传播走一遍算出所有节点的值
- reverse mode走一遍,存下cost对所有node的偏微分
Learn conditional statistics
When we use cross-entropy loss, we are learning p(y|x), actually we only need to learn one conditional statistics of y given x, say E(y|x).
-
if we are using the square loss, we can replace f(x) by E(y|x) and train the model using gradient descent
Here we can use we only need to use the information of E(y|x), not f(y|x), seemed that we are saving some resources.
The truth is, the cost function in this way is hard to train. Some output units that saturate produce very small gradients when combined with these cost functions.
Sigmoid units for Bernoulli Output distribution:
Intuition of define a probability over y using values of z:
we first obtain an unnormalized probability by assuming that log(p(y|z)) = yz, the left hand side can take value from -inf to inf since unnormalized p(y|z)>0 , and yz takes value from -inf to inf, so this assumption is reasonable.
Another reason of choosing this assumption is to undo the log under cross entropy.
So:
The symbol appeared in the last equation is called the softplus.
Softmax units for multinoulli output distributions
Regularization for deep learning
L2 normalization
Aka weight decay
Consider a model has the following total objective function:
With the corresponding parameter gradient
To take a single gradient step to update the weights, we perform this update:
Written another way, the update is:
We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step. We now examine what happens over the entire course of training.
L1 regularization
http://www.deeplearningbook.org/contents/regularization.html
Dataset Augmentation
Image classification, for example rotate the image
Injecting noise to the neural network: dropout (injecting noise to the hidden units)
Noise applied to the weights:
Injecting noise at the output Targets
Semi-supervised learning
- Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data).
- Intuitively, the learning problem can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam.
- https://sakura-gh.github.io/ML-notes/ML-notes-html/15_Semi-supervised-Learning.html
Multitask learning
Early stopping
- The phenomenon that training objective decreases consistently over time, but the validation set average loss eventually begins to increase again, forming an asymmetric U-shaped curve.
- consider the training time as a hyperparameter. Early stopping algorithm is to tune this parameter.
- do the training process and periodically record the model performance on validation set. If the performance decreases, we stop the training.
- early stopping is kind of a regularization method because it mathematically has the same effect as L2 regularizations.
parameter sharing
- from knowledge of the domain and model architecture, that there should be some dependencies between the model parameters
- a typical example is CNN
Dropout
- Dropout can be thought of as a method of making bagging practical for ensembles of very many large neural networks
- Intuition: cannot rely on any one feature, so have to spread out weights, resulting in shrinking weights like L2 norm.
- Frequently used in computer vision where input is huge.
- Dropout trains not just a bagged ensemble of model, but an ensemble of models that share hidden units. This means each hidden unit must be able to perform well regardless of which other hidden units are in the model. Hidden units must be prepared to be swapped and interchanged between models.
- The noise is multiplicative. If the noise were additive with fixed scale, then a relu hidden unit $h_i$ with added noise $\epsilon$ could simply learn to have $h_i$ become very large in order to make the added noise $\epsilon$ insignificant by comparison.
Sparse Representations
L1 penalization induces a sparse parametrization, meaning that many of the parameters become zero.
Representational sparsity, on the other hand, describes a representation where many of the elements of the representations are zero.
Just as an L1 penalty on the parameters induces parameter sparsity, an L1 penalty on the elements of the representation induces representational sparsity.
Bagging and other ensemble methods
Bagging is a technique for reducing generalization error by combining several models. The idea is to train several different models separately, then have all the models vote on the output for the test examples.
Consider for example a set of k regression models.
Adversarial Training
https://www.youtube.com/watch?v=CIfsB_EYsVI&t=1348s
You want to make some data that will have very low performance (fool the network). This is actually an easy task to do.
It is because attacking a linear model is simple.
Modern deep nets are very piecewise linear.
The fast gradient sign method is very popular to generate the adversarial examples:
-
这里用到了sign gradient,含义是如果往一个$\delta J(x)/\delta x_i$的值是正的,那么sign()以后的值就为1
Adversarial examples are not noise
Cross-technique transferability allows you to use the adversarial examples developed on your own training algorithms to fool others’ network
If you impose this adversarial attack on the autonomous vehicles, you can fool them to think the passengers as navigable road, which is super-dangerous!
Training on adversarial examples can help.
Another technique is called virtual adversarial training.
Adversarial training method can also be applied to universal engineering machine. We can make new inventions by finding input that maximizes model’s predicted performance.
Conclusion:
- Attacking is easy.
- Defending is difficult.
Open-source library cleverhans:
-
Deep learning training tips
testing the result on training data set
deeper does not imply better
Vanishing gradient problem
near input:
- smaller gradients
- learn very slow
- almost random
near output:
- larger gradients
- learn very fast
- already converge
use relu, not sigmoid
- relu: a thinner linear network, do not have smaller gradients
- sigmoid: vanishing gradient
- some variations of relu: leaky relu, parametric relu, maxout(learnable activation function)
- maxout:
- activation function in maxout network can be any piecewise linear convex function
- how many pieces depending on how many elements in a group
- how to train: train a thin and linear network. Different thin and linear network for different examples
Adaptive learning rate
adagrad: we give smaller learning rate when the historical gradients are large. 陡峭的地方一直陡峭,平坦的地方一直平坦。
rmsprop:
新看到的gradient比较大的weight,过去的gradient给比较小的weight。
Hard to find optimal network parameters
- very slow at the plateau
- stuck at saddle point
- stuck at local minima
Momentum
Intuition from the physical world: how about put momentum phenomenon into gradient descent
Momentum: movement of last step minus gradient at present (惯性)
Movement = negative of gradient + Momentum
Adam
Combining rmsprop and momentum
Regularization
- new loss function to be minimized
- L1 regularization: always delete
- L2 regularization: always scale
Early stopping
Dropout
training: each time before updating the parameter:
- each neuron has p% to dropout
Intuitive reason:
- when teams up, if everyone expect the partner will do the work, nothing will be done finally.
- However, if you know your partner will dropout, you will do better.
- When testing, no one dropout actually, so obtaining good results eventually.
Why the weights should multiply (1-p)% (dropout rate) when testing?
Assume dropout rate is 50%:
Sequence Modeling : Recurrent and Recursive Nets
Unfolding computational graphs
Recurrent Neural Network
Some examples of important design patterns for RNN:
Recurrent networks that produce an output at each time step and have recurrent connections between hidden units.
Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step.
Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output.
Forward propagation equations for the RNN in figure 10.3
We assume the hyperbolic tangent activation function
Apply softmax operation to obtain $\hat{y}$, which is a normalized probabilities over the output.
Begin with a specification of the initial state $h^{0}$. Then from t = 1 to t = $\tau$, we apply the following update equations:
If this is an example of a recurrent network that maps an input sequence to an output sequence of the same length. The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps.
Procedure:
We start the recursion with the nodes immediately preceding the final loss
Then
Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes.
However, the operator used in calculus takes into account the contribution of W to the value of f due to all edges in the computational graph. To resolve this ambiguity, we introduce dummy variables $W^{(t)}$ that are defined to be copies of W but with each used at time step t.
Teacher Forcing and Networks with Output Recurrence: basically used for figure 10.4
The network with recurrent connections only from the output figure 10.4 is less powerful
- Because it lacks hidden-to-hidden recurrent connections.
- it requires that the output units capture all of the information about the past that the network will use to predict the future.
- Because the output units are trained to match the training set targets, they are unlikely to capture the necessary information about the past history of the input.
The advantage of eliminating hidden-to-hidden recurrence is that
- For any loss function based on comparing the prediction at time t to the training target at time t, all the time steps are decoupled.
- Training can then be parallelized, with the gradient for each step t computed in isolation.
- There is no need to compute the output for the previous time step first, because the training set provides the ideal value of that output.
Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.
When training, using training set $y^{(t)}$ instead of $o^{(t)}$.
When testing, using model’s output $o^{(t)}$
Teacher forcing is originally motivated by avoiding back-propagation through time.
- The disadvantage of teacher forcing arises when the model is deployed, e.g we feed the output back into the model, the network will see quite different inputs from training time. One way to mitigate the problem is to train with both teacher-forced inputs and with free-running inputs, for example by predicting the correct target a number of steps in the future through the unfolded recurrent output-to-input paths.
Recurrent Networks as directed graphical models
- Let us consider a sequence of data points Y. The joint distribution of these observations:
- The negative log-likelihood of a set of values according to such a model is
- It is common to make the Markov assumption that the graphical model should only contain edges from , however, in some cases, we believe that the distribution over $y^{(t)}$ depends on a value of $y^{(i)}$ from the distant past.
- Let us consider the case where the RNN models only a sequence of scalar random variables Y , with no additional inputs x. The RNN then defines a directed graphical model over the y variables.
- To model such relationship, we do feed the actual y values back into the network, so the directed graphical model contains edges from all $y^{(i)}$ values in the past to the current $y^{(t)}$ value.
One way to interpret an RNN as a graphical model is to view the RNN as a complete graph, able to represent direct dependencies between any pair of y.
If we consider the graphical model structure of RNN that results from regarding the hidden units $h^{(t)}$ as random variables. It is actually a very efficient way of parametrization.
The price RNN pay for their reduced number of parameters is that optimizing the parameters may be difficult.
To complete the view of an RNN as a graphical model, we must describe how to draw samples from the model
- We need to sample from the conditional distribution at each time step
- The RNN must have some mechanism for determining the length of the sequence, List some as follows:
- By adding a special symbol corresponding to the end of a sequence. When the symbol is generated, the sampling process stops
- Introduce an extra Bernoulli output to the model that represents the decision to either continue generation or halt generation at each tie step
- Add an extra output to the model that predicts the integer $\tau$ itself.
Modeling Sequences Conditioned on Context with RNNs
- Previously, we have discussed RNNs that take a sequence of vectors $x^{(t)}$ for $t = 1,\ldots,\tau$ as input. Another option is to take only a single vector x as input. A first and most common approach is illustrated in figure 10.9
- Rather than receiving only a single vector x as input, the RNN may receive a sequence of vectors $x^{(t)}$ as input. The RNN described in equation 10.8 corresponds to a conditional distribution that makes a conditional independence assumption that this distribution factorizes as
To remove the conditional independence assumption, we can add connections from the output at time t to the hidden unit at time t+1, as shown in figure 10.10. The model can then represent arbitrary probability distributions over the y sequence.
- The model described in Figure 10.10 still has one restriction, which is that the length of both sequences must be the same.
Bidirectional RNNs
Bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence. See Figure 10.11
- The idea can be naturally extended to 2-dimensional input, such as images, by having four RNNs, each one going in one of the four directions: up, down, left , right.
- Application: speech recognition, handwriting recognition, bioinformatics
Encoder-decoder sequence-to-sequence architectures
How RNN can be trained to map an input sequence to an output sequence which is not necessarily of the same length.
The idea is very simple
- An encoder or reader or input RNN processes the input sequence. The encoder emits the context C
- A decoder or writer or output RNN is conditioned on that fixed-length vector to generate the output sequence Y
- The two RNNs are trained jointly to maximize the average of over all the pairs of x and y sequences in the training set
One clear limitation of this architecture is when the context C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence.
To mitigate the above problem, an attention mechanism that learns to associate elements of the sequence C to elements of the output sequence
Deep Recurrent Networks
Experimental evidence strongly suggests we introduce depth in each of these operations
Recursive Neural Networks
The Challenge of Long-Term Dependencies
- The vanishing and exploding gradient problem: consider , if , any component of $h^{(0)}$ that is not aligned with the largest eigenvector will eventually be discarded.
Echo State Networks
Leaky Units and Other Strategies for Multiple Time scales
The LSTM and other gated RNNs
LSTM:
Optimization for Long-Term Dependencies
Clipping Gradients
One option is to clip the parameter gradient from a mini batch element-wise just before the parameter update. Another is to clip the norm of the gradient g just before the parameter update.
Explicit Memory
- Neural network excel at storing implicit knowledge, However, it struggle to memorize facts.
- Implicit message: how to walk, or how a dog looks different from a cat
- Explicit message: I have a meeting with Susan at 11:00 am on Sunday
Applications
- Handwriting recognition using bidirectional LSTMs, data is collected from an interactive whiteboard. Sensors record the (x,y coordinates of the pen at regularly sampled time steps.