L1 Regularized Logistic Regression

We implemented subgradient descent - like method for L1 regularized logistic regression and nonlinear conjugate gradient method for Huber loss function regularized logistic regression. We investigated various aspects of these algorithms and used them on various datasets (obtained from the University of California at Irvine repository and CS229 class). Purpose To implement logistic regression with regularization using L1 and L1-like norms with different implementation strategies and investigate their properties. Notation The function to be minimized is of the form f(θ) + λ ||θ||n f(θ) is the negative of log likelihood function of logistic regression. ||θ||n is the regularization term. n=1 for L1 norm. λ ( ≥ 0 ) is the weight of regularization term. The noise (if added) is such that p fraction of the training labels were flipped. Methods (A) Subgradient descent – like algorithm Motivation L1 norm is not differentiable at zero and usual gradient descent can oscillate when one of the parameters is close to zero. To overcome the problem, we modify the update rule for the weights θ. Update rule for gradient descent: For a small positive e (~ 0.01), If |θi | > e, perform usual update [class notes, cs229] If |θi | ≤ e and | δ(f) / δ(θi) | ≤ λ, then don’t change θi. If |θi | ≤ e and if δ(f) / δ(θi) > λ, then θi := θi + α ( (δ(f) / δ(θi)) - λ ) if δ(f) / δ(θi) < -λ, then θi := θi + α ( (δ(f) / δ(θi)) + λ )

[1]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[2]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[3]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .