Nonconvex analysis of primal averaging gradient descent algorithm

Ask Question

Asked 7 months ago

Modified 7 months ago

Viewed 68 times

I've been looking at a specific optimization algorithm in the nonconvex setting, and I'm trying to analyze its convergence rate. Here's the quick setup:

Suppose we have a nonconvex, continuously differentiable function $f \colon \mathbb{R}^n \rightarrow \mathbb{R}$ that's $L$-smooth and bounded from below by $f^*$. Consider the algorithm defined by these updates (with constant step-size $\eta$ and diminishing parameter $\beta_t = 1/t$): $$ \begin{split} z_{t+1} & = z_t - \eta \nabla f(y_t), \\ y_{t+1} & = (1 - \beta_{t+1})y_t + \beta_{t+1} z_{t+1}. \end{split} $$ This method resembles `primal averaging' or momentum-based methods, and I'm particularly interested in the convergence of the squared gradient norm $\|\nabla f(y_t)\|^2$.

So far, I've been able to analyze the setup using the following potential (Lyapunov) function: $$ V_t = f(y_t) + \frac{\beta_t - L \beta_t^2 \eta}{2\eta(1 - \beta_t)^2}\|z_t - y_t\|^2\,. $$ With careful telescoping and standard assumptions, it's shown that the algorithm achieves at least a $1/\log(T)$ rate of convergence for the squared gradient norm. In other words, it is proven there that: $$ \min_{1\le t\le T}\|\nabla f(y_t)\|^2 \le \frac{\text{Const}}{\log(T)}\,. $$ That said, I have strong empirical analysis supporting the idea that this algorithm has a 1/T convergence rate for its squared norm gradient in the L-smooth non-convex setting.

Question: Is it possible (under these settings or perhaps minor additional assumptions) to rigorously derive a stronger rate of the form $$ \min_{1\le t\le T}\|\nabla f(y_t)\|^2 \le \frac{\text{Const}}{T}\,, $$ or at least better than the current $1/\log(T)$ result? If yes, what potential adjustments to the existing analysis might enable this tighter convergence result?

Any insights, pointers, or suggested adjustments to the Lyapunov analysis would be greatly appreciated!

edited Apr 27 at 14:58

Daniele Tampieri

6,83010 gold badges34 silver badges49 bronze badges

asked Apr 8 at 6:07

Connor Brown

111 bronze badge

$\begingroup$ I think arxiv.org/abs/1404.4805 is what you need; your example should correspond to the special case $g=0$. As usual, the analysis needs a KL inequality (which nearly all typical "test problems" satisfy, so this squares with your empirical evidence). $\endgroup$

Christian Clason
– Christian Clason

2025-04-27 16:27:14 +00:00
Commented Apr 27 at 16:27

Add a comment |

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Nonconvex analysis of primal averaging gradient descent algorithm

0

You must log in to answer this question.

Related