Base Optimizer class
tflearn.optimizers.Optimizer (learning_rate, use_locking, name)
A basic class to create optimizers to be used with TFLearn estimators.
First, The Optimizer class is initialized with given parameters,
but no Tensor is created. In a second step, invoking get_tensor
method
will actually build the Tensorflow Optimizer
Tensor, and return it.
This way, a user can easily specifies an optimizer with non default parameters and learning rate decay, while TFLearn estimators will build the optimizer and a step tensor by itself.
Arguments
- learning_rate:
float
. Learning rate. - use_locking:
bool
. If True use locks for update operation. - name:
str
. The optimizer name.
Attributes
- tensor:
Optimizer
. The optimizer tensor. - has_decay:
bool
. True if optimizer has a learning rate decay.
Methods
build (step_tensor=None)
This method creates the optimizer with specified parameters. It must
be implemented for every Optimizer
.
Arguments
- step_tensor:
tf.Tensor
. A variable holding the training step. Only necessary when optimizer has a learning rate decay.
get_tensor (self)
A method to retrieve the optimizer tensor.
Returns
The Optimizer
.
Stochastic Gradient Descent
tflearn.optimizers.SGD (learning_rate=0.001, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='SGD')
SGD Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses. The function returns the decayed learning rate. It is computed as:
decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
Examples
# With TFLearn estimators.
sgd = SGD(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=sgd)
# Without TFLearn estimators (returns tf.Optimizer).
sgd = SGD(learning_rate=0.01).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - use_locking:
bool
. If True use locks for update operation. - lr_decay:
float
. The learning rate decay to apply. - decay_step:
int
. Apply decay every provided steps. - staircase:
bool
. ItTrue
decay learning rate at discrete intervals. - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "GradientDescent".
RMSprop
tflearn.optimizers.RMSProp (learning_rate=0.001, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, name='RMSProp')
Maintain a moving (discounted) average of the square of gradients. Divide gradient by the root of this average.
Examples
# With TFLearn estimators.
rmsprop = RMSProp(learning_rate=0.1, decay=0.999)
regression = regression(net, optimizer=rmsprop)
# Without TFLearn estimators (returns tf.Optimizer).
rmsprop = RMSProp(learning_rate=0.01, decay=0.999).get_tensor()
# or
rmsprop = RMSProp(learning_rate=0.01, decay=0.999)()
Arguments
- learning_rate:
float
. Learning rate. - decay:
float
. Discounting factor for the history/coming gradient. - momentum:
float
. Momentum. - epsilon:
float
. Small value to avoid zero denominator. - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "RMSProp".
Adam
tflearn.optimizers.Adam (learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Examples
# With TFLearn estimators
adam = Adam(learning_rate=0.001, beta1=0.99)
regression = regression(net, optimizer=adam)
# Without TFLearn estimators (returns tf.Optimizer)
adam = Adam(learning_rate=0.01).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - beta1:
float
. The exponential decay rate for the 1st moment estimates. - beta2:
float
. The exponential decay rate for the 2nd moment estimates. - epsilon:
float
. A small constant for numerical stability. - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "Adam".
References
Adam: A Method for Stochastic Optimization. Diederik Kingma, Jimmy Ba. ICLR 2015.
Links
Momentum
tflearn.optimizers.Momentum (learning_rate=0.001, momentum=0.9, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='Momentum')
Momentum Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses. The function returns the decayed learning rate. It is computed as:
decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
Examples
# With TFLearn estimators
momentum = Momentum(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=momentum)
# Without TFLearn estimators (returns tf.Optimizer)
mm = Momentum(learning_rate=0.01, lr_decay=0.96).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - momentum:
float
. Momentum. - lr_decay:
float
. The learning rate decay to apply. - decay_step:
int
. Apply decay every provided steps. - staircase:
bool
. ItTrue
decay learning rate at discrete intervals. - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "Momentum".
AdaGrad
tflearn.optimizers.AdaGrad (learning_rate=0.001, initial_accumulator_value=0.1, use_locking=False, name='AdaGrad')
Examples
# With TFLearn estimators
adagrad = AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01)
regression = regression(net, optimizer=adagrad)
# Without TFLearn estimators (returns tf.Optimizer)
adagrad = AdaGrad(learning_rate=0.01).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - initial_accumulator_value:
float
. Starting value for the accumulators, must be positive - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "AdaGrad".
References
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Duchi, E. Hazan & Y. Singer. Journal of Machine Learning Research 12 (2011) 2121-2159.
Links
Ftrl Proximal
tflearn.optimizers.Ftrl (learning_rate=3.0, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl')
The Ftrl-proximal algorithm, abbreviated for Follow-the-regularized-leader, is described in the paper below.
It can give a good performance vs. sparsity tradeoff.
Ftrl-proximal uses its own global base learning rate and can behave like
Adagrad with learning_rate_power=-0.5
, or like gradient descent with
learning_rate_power=0.0
.
Examples
# With TFLearn estimators.
ftrl = Ftrl(learning_rate=0.01, learning_rate_power=-0.1)
regression = regression(net, optimizer=ftrl)
# Without TFLearn estimators (returns tf.Optimizer).
ftrl = Ftrl(learning_rate=0.01).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - learning_rate_power:
float
. Must be less or equal to zero. - initial_accumulator_value:
float
. The starting value for accumulators. Only positive values are allowed. - l1_regularization_strength:
float
. Must be less or equal to zero. - l2_regularization_strength:
float
. Must be less or equal to zero. - use_locking: bool`. If True use locks for update operation.
- name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "Ftrl".
Links
Ad Click Prediction: a View from the Trenches
AdaDelta
tflearn.optimizers.AdaDelta (learning_rate=0.001, rho=0.1, epsilon=1e-08, use_locking=False, name='AdaDelta')
Construct a new Adadelta optimizer.
Arguments
- learning_rate: A
Tensor
or a floating point value. The learning rate. - rho: A
Tensor
or a floating point value. The decay rate. - epsilon: A
Tensor
or a floating point value. A constant epsilon used to better conditioning the grad update. - use_locking: If
True
use locks for update operations. - name: Optional name prefix for the operations created when applying gradients. Defaults to "Adadelta".
References
ADADELTA: An Adaptive Learning Rate Method, Matthew D. Zeiler, 2012.
Links
http://arxiv.org/abs/1212.5701
ProximalAdaGrad
tflearn.optimizers.ProximalAdaGrad (learning_rate=0.001, initial_accumulator_value=0.1, use_locking=False, name='AdaGrad')
Examples
# With TFLearn estimators
proxi_adagrad = ProximalAdaGrad(learning_rate=0.01,l2_regularization_strength=0.01,initial_accumulator_value=0.01)
regression = regression(net, optimizer=proxi_adagrad)
# Without TFLearn estimators (returns tf.Optimizer)
adagrad = ProximalAdaGrad(learning_rate=0.01).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - initial_accumulator_value:
float
. Starting value for the accumulators, must be positive - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "AdaGrad".
References
Efficient Learning using Forward-Backward Splitting. J. Duchi, Yoram Singer, 2009.
Links
Nesterov
tflearn.optimizers.Nesterov (learning_rate=0.001, momentum=0.9, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='Nesterov')
The main difference between classical momentum and nesterov is: In classical momentum you first correct your velocity and then make a big step according to that velocity (and then repeat), but in Nesterov momentum you first making a step into velocity direction and then make a correction to a velocity vector based on new location (then repeat). See Sutskever et. al., 2013
Examples
# With TFLearn estimators
nesterov = Nesterov(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=nesterov)
# Without TFLearn estimators (returns tf.Optimizer)
mm = Neserov(learning_rate=0.01, lr_decay=0.96).get_tensor()
Arguments
- learning_rate:
float
. Learning rate. - momentum:
float
. Momentum. - lr_decay:
float
. The learning rate decay to apply. - decay_step:
int
. Apply decay every provided steps. - staircase:
bool
. ItTrue
decay learning rate at discrete intervals. - use_locking:
bool
. If True use locks for update operation. - name:
str
. Optional name prefix for the operations created when applying gradients. Defaults to "Momentum".