I've been looking at a specific optimization algorithm in the nonconvex setting, and I'm trying to analyze its convergence rate. Here's the quick setup:
Suppose we have a nonconvex, continuously differentiable function $f \colon \mathbb{R}^n \rightarrow \mathbb{R}$ that's $L$-smooth and bounded from below by $f^*$. Consider the algorithm defined by these updates (with constant step-size $\eta$ and diminishing parameter $\beta_t = 1/t$): $$ \begin{split} z_{t+1} & = z_t - \eta \nabla f(y_t), \\ y_{t+1} & = (1 - \beta_{t+1})y_t + \beta_{t+1} z_{t+1}. \end{split} $$ This method resembles `primal averaging' or momentum-based methods, and I'm particularly interested in the convergence of the squared gradient norm $\|\nabla f(y_t)\|^2$.
So far, I've been able to analyze the setup using the following potential (Lyapunov) function: $$ V_t = f(y_t) + \frac{\beta_t - L \beta_t^2 \eta}{2\eta(1 - \beta_t)^2}\|z_t - y_t\|^2\,. $$ With careful telescoping and standard assumptions, it's shown that the algorithm achieves at least a $1/\log(T)$ rate of convergence for the squared gradient norm. In other words, it is proven there that: $$ \min_{1\le t\le T}\|\nabla f(y_t)\|^2 \le \frac{\text{Const}}{\log(T)}\,. $$ That said, I have strong empirical analysis supporting the idea that this algorithm has a 1/T convergence rate for its squared norm gradient in the L-smooth non-convex setting.
Question: Is it possible (under these settings or perhaps minor additional assumptions) to rigorously derive a stronger rate of the form $$ \min_{1\le t\le T}\|\nabla f(y_t)\|^2 \le \frac{\text{Const}}{T}\,, $$ or at least better than the current $1/\log(T)$ result? If yes, what potential adjustments to the existing analysis might enable this tighter convergence result?
Any insights, pointers, or suggested adjustments to the Lyapunov analysis would be greatly appreciated!