casimir.optim.
IncrementalFirstOrderOracle
¶Base class for incremental first order oracles (IFO).
For a function \(f(w)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\),
this class is an interface to implement methods to compute
\(f_i(w)\) and its gradient \(\nabla f_i(w)\),
as well as the batch function value \(f(w)\) and the batch gradient \(\nabla f(w)\)
if each \(f_i\) is smooth. If \(f_i\) is not smooth,
use subclass SmoothedIncrementalFirstOrderOracle
instead.
A concrete implementation must at least override the methods function_value
, gradient
and __len__
.
Optionally, it may also override evaluation_function
for domain specific evaluation metrics such as
classification accuracy. The methods batch_function_value
and batch_gradient
are implemented as a loop
that averages over the outputs of function_value
and gradient
respectively by default.
If a more efficient implementation exists, these methods can be overridden by derived classes as well.
Input idx
to function value
and gradient
must not exceed \(n\).
This is not explicitly checked here.
batch_function_value
(model)¶Return function value \(f(w)\) where \(w\) is model
.
batch_gradient
(model)¶Return gradient \(\nabla f(w)\) where \(w\) is model
.
evaluation_function
(model)¶Return domainspecific task metrics (default None
).
function_value
(model, idx)¶Return function value \(f_i(w)\) where \(w\) is model
and \(i\) is idx
.
gradient
(model, idx)¶Return gradient \(\nabla f_i(w)\) where \(w\) is model
and \(i\) is idx
.
casimir.optim.
SmoothedIncrementalFirstOrderOracle
(smoothing_coefficient=None)¶Base class of smoothed incremental first order oracles.
For a function \(f(x)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), where each \(f_i\) is nonsmooth but smoothable, this class is an interface to implement methods to compute \(f_{i, \mu}(w)\) and its gradient \(\nabla f_{i, \mu}(w)\), as well as the batch function value \((f(w), f_{\mu}(w))\) and the batch gradient \(\nabla f_{\mu}(w)\). Here, \(g_\mu\) is a smooth surrogate to the nonsmooth function \(g\) that is parameterized by a smoothing coefficient \(\mu\).
When \(\mu\) is ``None``(i.e., no smoothing), the implementation must serve as an IFO for the original nonsmooth function \(f\), in which case \(\nabla f_i\) refers to a subgradient of \(f_i\).
Note
This class contains a field smoothing_coefficient
to represent \(\mu\),
which can be mutated by optimization algorithms or other functions that use adaptive smoothing schemes.
function_value
(model, idx)¶Return the pair \(\big( f_i(w), f_{i, \mu}(w) \big)\).
If smoothing_coefficient
is None
, i.e., no smoothing,
simply return \(f_i(w)\), where \(i\) represents the index idx
.
gradient
(model, idx)¶Return that gradient \(\nabla f_{i, \mu}(w)\) if smoothing_coefficient
is not None
or \(\nabla f_i(w)\) if smoothing_coefficient
is None
.
casimir.optim.
optimize_ifo
(initial_model, train_ifo, dev_ifo=None, test_ifo=None, algorithm='SGD', reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True, optim_options=None)¶Minimize a convex function with access to a (smoothed) incremental first order oracle using a primal algorithm.
Functions that can be handled are of the form \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) + r(w)\),
where a (smoothed) incremental first order oracle gives access to each \(f_i\) and \(r(w)\) is a (sum of)
regularization penalties. The argument algorithm
controls which optimization algorithm is used.
Parameters: 


Returns:  The tuple (final iterate, logs) where logs are a list of lists, one for each epoch. The list contains
the epoch number, train function value, dev function value, dev evaluation metric, test function value, test
evaluation metric and the time taken to perform this epoch. The dev (test) statistics are printed only if

casimir.optim.
Optimizer
(initial_model, reg_penalties=None)¶Base class for optimizers using (smoothed) incremental first order oracles.
Optimizer classes store the state of the variables to be optimized.
Any subclass must override the methods start_epoch
, step
, end_epoch
.
end_epoch
()¶Perform optional computation to end epoch and return current iterate to be used for logging.
Returns:  model, denoting optimizer state at current time. 

start_epoch
(train_ifo, epoch)¶Prepare for start of epoch.
Example uses are warm start in CasimirSVRG or batch gradient computation in SVRG. Adaptive smoothing algorithms may update smoothing parameter here.
Param:  train_ifo: (Smoothed) Incremental First Order Oracle for the training set. This object may be mutated by adaptive smoothing algorithms. 

step
(train_ifo, idx, iteration)¶Take a step based on sample indexed by idx
.
Parameters: 


casimir.optim.
CasimirSVRGOptimizer
(initial_model, reg_penalties, learning_rate=None, grad_lipschitz_parameter=None, initial_smoothing_coefficient=None, smoothing_coefficient_update_rule='const', initial_moreau_coefficient=None, moreau_coefficient_update_rule='const', warm_start='proxcenter')¶Implement Casimir (Catalyst with smoothing) or Catalyst outer loop with SVRG as the inner loop.
This optimizer can be invoked by calling optimize_ifo()
with argument
algorithm='casimir_svrg'
or algorithm='catalyst_svrg'
.
initial_smoothing_coefficient
and initial_moreau_coefficient
are both set to None
,
and the moreau_coefficient_update_rule
is set to 'const'
.Parameters: 


The following named parameters can be passed though optim_options
of optimize_ifo()
:
Parameters: 


smoothing_coefficient_update_rule
'const'
'linear'
'expo'
warm_start
'proxcenter'
'previterate'
'extrapolation`
end_epoch
()¶Remove prox penalty, compute \(lpha, eta\) and update averaged iterate.
start_epoch
(train_ifo, epoch)¶Update Moreau coefficient, smoothing coefficient, learning rate, warm start and batch gradient.
step
(train_ifo, idx, iteration)¶Take a single inner SVRG step and update averaged iterate.
casimir.optim.
SGDOptimizer
(initial_model, reg_penalties=None, learning_rate_scheme='const', averaging='wavg', initial_learning_rate=1.0, learning_rate_time_factor=100.0)¶Implement the stochastic (sub)gradient method with various learning rate and averaging schemes.
This optimizer can be invoked by calling optimize_ifo()
with argument
algorithm='sgd'
.
Parameters: 


The following named parameters can be passed though optim_options
of optimize_ifo()
:
Parameters: 


Allowed values of the parameter learning_rate_scheme
and corresponding learning rates \(\eta_t\) at iteration t are:
'const'
(default): \(\eta_t = \eta_0\),'pegasos'
: \(\eta_t = 1/(\lambda t)\),'linear'
: \(\eta_t = \eta_0 / (1 + t/t_0)\)Here, \(\eta_0, t_0\) are parameters described above and \(\lambda\) the strong convexity.
Allowed values of the parameter averaging
and the corresponding averaged iterates used are:
'none'
: no averaging, use final iterate'uavg'
: uniform average \(\bar w_T = \frac{1}{T+1}\sum_{t=0}^{T} w_t\)'wavg'
(default): weighted average \(\bar w_T = \frac{2}{T(T+1)} \sum_{t=0}^{T} t\, w_t\)end_epoch
()¶Return averaged iterate.
start_epoch
(train_ifo, epoch)¶Do nothing.
step
(train_ifo, idx, iteration)¶Perform a single SGD step and update averaged iterates.
casimir.optim.
SVRGOptimizer
(initial_model, reg_penalties, learning_rate=1.0, smoothing_coefficient=None)¶Implement Stochastic Variance Reduced Gradient (SVRG) with optional smoothing.
This optimizer can be invoked by calling optimize_ifo()
with argument
algorithm='svrg'
.
The following named parameters can be passed though optim_options
of optimize_ifo()
:
learning_rate
and smoothing_coefficient
.
end_epoch
()¶Copy averaged iterate to main iterate and return averaged iterate.
start_epoch
(train_ifo, epoch)¶Update smoothing coefficient, reset averaged iterate and update batch gradient.
step
(train_ifo, idx, iteration)¶Take a single SVRG step and update averaged iterate.
casimir.optim.
block_coordinate_frank_wolfe_optimize
(initial_model, train_ifo, dev_ifo=None, test_ifo=None, reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True)¶Implement the Block Coordinate FrankWolfe (BCFW) algorithm for structured prediction.
This algorithm is not in the incremental first order oracle model. It requires the task loss in addition to
first order information of the objective function. From an implementation point of view, it requires an IFO with the
linear_minimization_oracle
method implemented.
Implemented here is Algorithm 4 of LacosteJulien et. al. Blockcoordinate FrankWolfe optimization for structural SVMs (2012).
Parameters: 


Returns:  The tuple (final iterate, logs). 
casimir.optim.
termination_gradient_norm
(model, train_ifo, gradient_norm_tolerance=1e05)¶Terminate optimization after gradient norm falls below a certain tolerance.
Parameters: 


Returns:  True if gradient norm is smaller than tolerance, false otherwise. 
casimir.optim.
L2Penalty
(regularization_parameter, prox_center=None)¶Class representing the L2 regularization penalty.
For a proxcenter \(z\) and regularization parameter \(\lambda\), the penalty takes the form \(r(w) = \frac{\lambda}{2}\wz\^2_2\).
Parameters: 


function_value
(model)¶Return function value of regularization penalty.
gradient
(model)¶Return gradient of penalty at model.
strong_convexity
()¶Strong convexity coefficient w.r.t. L2 norm.