CS289ML: Notes on convergence of gradient descent

• A smooth function f : Rd → R is L-Lipschitz if for all x, y ∈ Rd, ‖∇f(x)−∇f(y)‖2 ≤ L‖x− y‖2. Examples x2, ex are all convex. So is the loss function from least squares regression LSR f(x) = (1/n) ∑n j=1(〈aj , x〉− bj). For what L is this loss Lipschitz? Let A be the n× d matrix whose rows are the vectors aj and let b be the n-dimensional vector consisting of the bj ’s. Then, we can also write f(x) = (1/n)‖Ax− b‖2. So that ∇f(x) = (2/n)AT (Ax− b). Therefore, for x, y ∈ Rd, ‖∇f(x)−∇f(y)‖ = (2/n)‖A (Ax−b)−A (Ay−b)‖ = (2/n)‖AA(x−y)‖ ≤ (2/n)‖AA‖2 ·‖x−y‖,