Adaptive Subgradient Method#
The adaptive subgradient method (AdaGrad) [Duchi2011] follows the algorithmic framework of an iterative solver with the algorithm-specific transformation \(T\), set of intrinsic parameters \(S_t\) defined for the learning rate \(\eta\), and algorithm-specific vector \(U\) and power \(d\) of Lebesgue space defined as follows:
\(T(\theta_{t - 1}, g(\theta_{t - 1}), S_{t - 1})\):
\(G_{t, i} = G_{t - 1, i} + g_i^2(\theta_{t - 1})\), where \(g_i(\theta_{t - 1})\) is the \(i\)-th coordinate of the gradient \(g(\theta_{t - 1})\)
\(\theta_t = \theta_{t - 1} - \frac {\eta}{\sqrt{G_t + \varepsilon}} g(\theta_{t - 1})\), where
\[\frac {\eta}{\sqrt{G_t + \varepsilon}} g(\theta_{t - 1}) = \{\frac {\eta}{\sqrt{G_{t, 1} + \varepsilon}} g_1(\theta_{t - 1}), \ldots, \frac {\eta}{\sqrt{G_{t, 1} + \varepsilon}} g_p(\theta_{t - 1})\}\]
Convergence check: \(U = g(\theta_{t - 1}), d = 2\)
Computation#
The adaptive subgradient (AdaGrad) method is a special case of an iterative solver. For parameters, input, and output of iterative solvers, see Computation for Iterative Solver.
Algorithm Input#
In addition to the input of the iterative solver, the AdaGrad method accepts the following optional input:
OptionalDataID |
Input |
---|---|
|
A numeric table of size \(p \times 1\) with the values of \(G_t\). Each value is an accumulated sum of squares of coordinate values of a corresponding gradient. |
Algorithm Parameters#
In addition to parameters of the iterative solver, the AdaGrad method has the following parameters:
Parameter |
Default Value |
Description |
---|---|---|
|
|
The floating-point type that the algorithm uses for intermediate computations. Can be |
|
|
Default performance-oriented computation method. |
|
|
A numeric table of size \(\text{nIterations} \times \text{batchSize}\) for the |
|
\(128\) |
The number of batch indices to compute the stochastic gradient. If The algorithm ignores this parameter if the |
|
A numeric table of size \(1 \times 1\) that contains the default step length equal to \(0.01\). |
A numeric table of size \(1 \times 1\) that contains the value of learning rate \(\eta\). Note This parameter can be an object of any class derived from |
|
\(1\mathrm{e}{-08}\) |
Value \(\varepsilon\) needed to avoid degenerate cases when computing square roots. |
|
SharePtr< engines:: mt19937:: Batch>() |
Pointer to the random number generator engine that is used internally for generation of 32-bit integer indices of terms in the objective function. |
Algorithm Output#
In addition to the output of the iterative solver, the AdaGrad method calculates the following optional result:
OptionalDataID |
Output |
---|---|
|
A numeric table of size \(p \times 1\) with the values of \(G_t\). Each value is an accumulated sum of squares of coordinate values of a corresponding gradient. |