Base Optimizer class

tflearn.optimizers.Optimizer (learning_rate, use_locking, name)

A basic class to create optimizers to be used with TFLearn estimators. First, The Optimizer class is initialized with given parameters, but no Tensor is created. In a second step, invoking get_tensor method will actually build the Tensorflow Optimizer Tensor, and return it.

This way, a user can easily specifies an optimizer with non default parameters and learning rate decay, while TFLearn estimators will build the optimizer and a step tensor by itself.

Arguments

  • learning_rate: float. Learning rate.
  • use_locking: bool. If True use locks for update operation.
  • name: str. The optimizer name.

Attributes

  • tensor: Optimizer. The optimizer tensor.
  • has_decay: bool. True if optimizer has a learning rate decay.

Methods

build (step_tensor=None)

This method creates the optimizer with specified parameters. It must be implemented for every Optimizer.

Arguments
  • step_tensor: tf.Tensor. A variable holding the training step. Only necessary when optimizer has a learning rate decay.

get_tensor (self)

A method to retrieve the optimizer tensor.

Returns

The Optimizer.


Stochastic Gradient Descent

tflearn.optimizers.SGD (learning_rate=0.001, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='SGD')

SGD Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses. The function returns the decayed learning rate. It is computed as:

decayed_learning_rate = learning_rate *  decay_rate ^ (global_step / decay_steps)

Examples

# With TFLearn estimators.
sgd = SGD(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=sgd)

# Without TFLearn estimators (returns tf.Optimizer).
sgd = SGD(learning_rate=0.01).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • use_locking: bool. If True use locks for update operation.
  • lr_decay: float. The learning rate decay to apply.
  • decay_step: int. Apply decay every provided steps.
  • staircase: bool. It True decay learning rate at discrete intervals.
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "GradientDescent".

RMSprop

tflearn.optimizers.RMSProp (learning_rate=0.001, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, name='RMSProp')

Maintain a moving (discounted) average of the square of gradients. Divide gradient by the root of this average.

Examples

# With TFLearn estimators.
rmsprop = RMSProp(learning_rate=0.1, decay=0.999)
regression = regression(net, optimizer=rmsprop)

# Without TFLearn estimators (returns tf.Optimizer).
rmsprop = RMSProp(learning_rate=0.01, decay=0.999).get_tensor()
# or
rmsprop = RMSProp(learning_rate=0.01, decay=0.999)()

Arguments

  • learning_rate: float. Learning rate.
  • decay: float. Discounting factor for the history/coming gradient.
  • momentum: float. Momentum.
  • epsilon: float. Small value to avoid zero denominator.
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "RMSProp".

Adam

tflearn.optimizers.Adam (learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

Examples

# With TFLearn estimators
adam = Adam(learning_rate=0.001, beta1=0.99)
regression = regression(net, optimizer=adam)

# Without TFLearn estimators (returns tf.Optimizer)
adam = Adam(learning_rate=0.01).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • beta1: float. The exponential decay rate for the 1st moment estimates.
  • beta2: float. The exponential decay rate for the 2nd moment estimates.
  • epsilon: float. A small constant for numerical stability.
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "Adam".

References

Adam: A Method for Stochastic Optimization. Diederik Kingma, Jimmy Ba. ICLR 2015.

Links

Paper


Momentum

tflearn.optimizers.Momentum (learning_rate=0.001, momentum=0.9, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='Momentum')

Momentum Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses. The function returns the decayed learning rate. It is computed as:

decayed_learning_rate = learning_rate *  decay_rate ^ (global_step / decay_steps)

Examples

# With TFLearn estimators
momentum = Momentum(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=momentum)

# Without TFLearn estimators (returns tf.Optimizer)
mm = Momentum(learning_rate=0.01, lr_decay=0.96).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • momentum: float. Momentum.
  • lr_decay: float. The learning rate decay to apply.
  • decay_step: int. Apply decay every provided steps.
  • staircase: bool. It True decay learning rate at discrete intervals.
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "Momentum".

AdaGrad

tflearn.optimizers.AdaGrad (learning_rate=0.001, initial_accumulator_value=0.1, use_locking=False, name='AdaGrad')

Examples

# With TFLearn estimators
adagrad = AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01)
regression = regression(net, optimizer=adagrad)

# Without TFLearn estimators (returns tf.Optimizer)
adagrad = AdaGrad(learning_rate=0.01).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • initial_accumulator_value: float. Starting value for the accumulators, must be positive
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "AdaGrad".

References

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Duchi, E. Hazan & Y. Singer. Journal of Machine Learning Research 12 (2011) 2121-2159.

Links

Paper


Ftrl Proximal

tflearn.optimizers.Ftrl (learning_rate=3.0, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, use_locking=False, name='Ftrl')

The Ftrl-proximal algorithm, abbreviated for Follow-the-regularized-leader, is described in the paper below.

It can give a good performance vs. sparsity tradeoff.

Ftrl-proximal uses its own global base learning rate and can behave like Adagrad with learning_rate_power=-0.5, or like gradient descent with learning_rate_power=0.0.

Examples

# With TFLearn estimators.
ftrl = Ftrl(learning_rate=0.01, learning_rate_power=-0.1)
regression = regression(net, optimizer=ftrl)

# Without TFLearn estimators (returns tf.Optimizer).
ftrl = Ftrl(learning_rate=0.01).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • learning_rate_power: float. Must be less or equal to zero.
  • initial_accumulator_value: float. The starting value for accumulators. Only positive values are allowed.
  • l1_regularization_strength: float. Must be less or equal to zero.
  • l2_regularization_strength: float. Must be less or equal to zero.
  • use_locking: bool`. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "Ftrl".

Links

Ad Click Prediction: a View from the Trenches


AdaDelta

tflearn.optimizers.AdaDelta (learning_rate=0.001, rho=0.1, epsilon=1e-08, use_locking=False, name='AdaDelta')

Construct a new Adadelta optimizer.

Arguments

  • learning_rate: A Tensor or a floating point value. The learning rate.
  • rho: A Tensor or a floating point value. The decay rate.
  • epsilon: A Tensor or a floating point value. A constant epsilon used to better conditioning the grad update.
  • use_locking: If True use locks for update operations.
  • name: Optional name prefix for the operations created when applying gradients. Defaults to "Adadelta".

References

ADADELTA: An Adaptive Learning Rate Method, Matthew D. Zeiler, 2012.

Links

http://arxiv.org/abs/1212.5701


ProximalAdaGrad

tflearn.optimizers.ProximalAdaGrad (learning_rate=0.001, initial_accumulator_value=0.1, use_locking=False, name='AdaGrad')

Examples

# With TFLearn estimators
proxi_adagrad = ProximalAdaGrad(learning_rate=0.01,l2_regularization_strength=0.01,initial_accumulator_value=0.01)
regression = regression(net, optimizer=proxi_adagrad)

# Without TFLearn estimators (returns tf.Optimizer)
adagrad = ProximalAdaGrad(learning_rate=0.01).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • initial_accumulator_value: float. Starting value for the accumulators, must be positive
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "AdaGrad".

References

Efficient Learning using Forward-Backward Splitting. J. Duchi, Yoram Singer, 2009.

Links

Paper


Nesterov

tflearn.optimizers.Nesterov (learning_rate=0.001, momentum=0.9, lr_decay=0.0, decay_step=100, staircase=False, use_locking=False, name='Nesterov')

The main difference between classical momentum and nesterov is: In classical momentum you first correct your velocity and then make a big step according to that velocity (and then repeat), but in Nesterov momentum you first making a step into velocity direction and then make a correction to a velocity vector based on new location (then repeat). See Sutskever et. al., 2013

Examples

# With TFLearn estimators
nesterov = Nesterov(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=nesterov)

# Without TFLearn estimators (returns tf.Optimizer)
mm = Neserov(learning_rate=0.01, lr_decay=0.96).get_tensor()

Arguments

  • learning_rate: float. Learning rate.
  • momentum: float. Momentum.
  • lr_decay: float. The learning rate decay to apply.
  • decay_step: int. Apply decay every provided steps.
  • staircase: bool. It True decay learning rate at discrete intervals.
  • use_locking: bool. If True use locks for update operation.
  • name: str. Optional name prefix for the operations created when applying gradients. Defaults to "Momentum".