Model representation

Notation:

m = Number of training examples
x’s = “input” variable/features
y’s = “output” variables/”target” variable
(x,y) = one training example
= ith training example
h = hypothesis
Cost function
Hypothesis:
:Parameters , but how to choose ?
Minimize modeling error:
Gradient descent
= learning rate
Batch Gradient Descent:Each step of gradient descent uses all the training examples.
Stochastic gradient descent(SGD):Use one example in each iteration
Mini-batch Gradient descent:Use some examples in each iteration
Momentum:
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.
AdaGrad:
Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
Adam:
Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum.