A Complete understanding of LASSO Regression

lasso regression

Contributed by: Dinesh Kumar

We hope you’ve had a chance to go through our previous articles on regression and ridge regression. Building on that foundation, it’s now time to explore lasso regression.

Like its predecessors, lasso regression is a powerful tool for predictive modeling, but it comes with its unique twist—penalizing the absolute size of the regression coefficients.

Understanding all three methods together not only deepens your knowledge of statistical techniques but also equips you with a versatile toolkit for tackling real-world data challenges.

This article will delve into why lasso regression is essential and how it complements the techniques we’ve previously discussed. If you’re looking to enhance your skills further, enrolling in a machine learning course can provide you with the structured learning needed to master techniques like lasso regression and many others. So, let’s dive in and connect these dots to see why each method is crucial and why continuing this learning journey through our blog is invaluable.

Introduction to LASSO Regression

LASSO regression, also known as L1 regularization, is a popular technique used in statistical modeling and machine learning to estimate the relationships between variables and make predictions. LASSO stands for Least Absolute Shrinkage and Selection Operator.

The primary goal of LASSO regression is to find a balance between model simplicity and accuracy. It achieves this by adding a penalty term to the traditional linear regression model, which encourages sparse solutions where some coefficients are forced to be exactly zero.

This feature makes LASSO particularly useful for feature selection, as it can automatically identify and discard irrelevant or redundant variables.

What is Lasso Regression?

Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean.

The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

Lasso Regression uses L1 regularization technique (will be discussed later in this article). It is used when we have more features because it automatically performs feature selection.

Step By Step Explanation Of How LASSO Regression Works?

This indicates the root mean squared error of the Lasso Regression model on predicting house prices in the test set.

  1. Linear Regression Model
    LASSO regression starts with the standard linear regression model, which assumes a linear relationship between the independent variables (features) and the dependent variable (target). The linear regression equation can be represented as follows:
    = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

 Where:

  • y is the dependent variable (target).
  • β₀, β₁, β₂, …, βₚ are the coefficients (parameters) to be estimated.
  • x₁, x₂, …, xₚ are the independent variables (features).
  • ε represents the error term.
  1. L1 Regularization
    LASSO regression introduces an additional penalty term based on the absolute values of the coefficients. The L1 regularization term is the sum of the absolute values of the coefficients multiplied by a tuning parameter λ:

    L₁ = λ * (|β₁| + |β₂| + … + |βₚ|)

    Where:
    • λ is the regularization parameter that controls the amount of regularization applied.
    • β₁, β₂, …, βₚ are the coefficients.
  1. Objective Function
    The objective of LASSO regression is to find the values of the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, while also minimizing the L1 regularization term:

    RSS + L₁ 

    Where:
    RSS is the residual sum of squares, which measures the error between the predicted values and the actual values.
  1. Shrinking Coefficients
    By adding the L1 regularization term, LASSO regression can shrink the coefficients towards zero. When λ is sufficiently large, some coefficients are driven to exactly zero. 

    This property of LASSO makes it useful for feature selection, as the variables with zero coefficients are effectively removed from the model.
  1. Tuning parameter λ
    The choice of the regularization parameter λ is crucial in LASSO regression. A larger λ value increases the amount of regularization, leading to more coefficients being pushed towards zero.

    Conversely, a smaller λ value reduces the regularization effect, allowing more variables to have non-zero coefficients.
  1. Model Fitting
    To estimate the coefficients in LASSO regression, an optimization algorithm is used to minimize the objective function. Coordinate Descent is commonly employed, which iteratively updates each coefficient while holding the others fixed.

LASSO regression offers a powerful framework for both prediction and feature selection, especially when dealing with high-dimensional datasets where the number of features is large.

By striking a balance between simplicity and accuracy, LASSO can provide interpretable models while effectively managing the risk of overfitting.

It’s worth noting that LASSO is just one type of regularization technique, and there are other variants such as Ridge regression (L2 regularization) and Elastic Net

Lasso Meaning

The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for the regularisation of data models and feature selection.

Regularization

Regularization is an important concept that is used to avoid overfitting of the data, especially when the trained and test data are much varying.

Regularization is implemented by adding a “penalty” term to the best fit derived from the trained data, to achieve a lesser variance with the tested data and also restricts the influence of predictor variables over the output variable by compressing their coefficients.

In regularization, what we do is normally we keep the same number of features but reduce the magnitude of the coefficients. We can reduce the magnitude of the coefficients by using different types of regression techniques which uses regularization to overcome this problem. So, let us discuss them.

Also Read: Python Tutorial for Beginners

Lasso Regularization Techniques

There are two main regularization techniques, namely Ridge Regression and Lasso Regression. They both differ in the way they assign a penalty to the coefficients. In this blog, we will try to understand more about Lasso Regularization technique.

L1 Regularization

If a regression model uses the L1 Regularization technique, then it is called Lasso Regression. If it used the L2 regularization technique, it’s called Ridge Regression. We will study more about these in the later sections.

L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient. This regularization type can result in sparse models with few coefficients. Some coefficients might become zero and get eliminated from the model.

Larger penalties result in coefficient values that are closer to zero (ideal for producing simpler models). On the other hand, L2 regularization does not result in any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to interpret as compared to the Ridge.

While there are ample resources available online to help you understand the subject but getting certified for it is something beyond ordinary.

Check out Great Learning’s best artificial intelligence course online to upskill in the domain. This course will help you learn from a top-ranking global school to build job-ready AIML skills. This 12-month program offers a hands-on learning experience with top faculty and mentors. On completion, you will receive a Certificate from The University of Texas at Austin, and Great Lakes Executive Learning.

Mathematical equation of Lasso Regression

Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)

Where,

  • λ denotes the amount of shrinkage.
  • λ = 0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares is considered to build a predictive model
  • λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and more features
  • The bias increases with increase in λ
  • variance increases with decrease in λ

Lasso Regression in Python

For this example code, we will consider a dataset from Machine hack’s Predicting Restaurant Food Cost Hackathon.

About the Data Set

The task here is about predicting the average price for a meal. The data consists of the following features.

Size of training set: 12,690 records

Size of test set: 4,231 records

Columns/Features

TITLE: The feature of the restaurant which can help identify what and for whom it is suitable for.

RESTAURANT_ID: A unique ID for each restaurant.

CUISINES: The variety of cuisines that the restaurant offers.

TIME: The open hours of the restaurant.

CITY: The city in which the restaurant is located.

LOCALITY: The locality of the restaurant.

RATING: The average rating of the restaurant by customers.

VOTES: The overall votes received by the restaurant.

COST: The average cost of a two-person meal.

After completing all the steps till Feature Scaling (Excluding), we can proceed to building a Lasso regression. We are avoiding feature scaling as the lasso regression comes with a parameter that allows us to normalise the data while fitting it to the model.

Lasso regression example

import numpy as np

Creating a New Train and Validation Datasets

from sklearn.model_selection import train_test_split
data_train, data_val = train_test_split(new_data_train, test_size = 0.2, random_state = 2)

Classifying Predictors and Target

#Classifying Independent and Dependent Features
#_______________________________________________
#Dependent Variable
Y_train = data_train.iloc[:, -1].values
#Independent Variables
X_train = data_train.iloc[:,0 : -1].values
#Independent Variables for Test Set
X_test = data_val.iloc[:,0 : -1].values

Evaluating The Model With RMLSE

def score(y_pred, y_true):
error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5
score = 1 - error
return score
actual_cost = list(data_val['COST'])
actual_cost = np.asarray(actual_cost)

Building the Lasso Regressor

#Lasso Regression


from sklearn.linear_model import Lasso
#Initializing the Lasso Regressor with Normalization Factor as True
lasso_reg = Lasso(normalize=True)
#Fitting the Training data to the Lasso regressor
lasso_reg.fit(X_train,Y_train)
#Predicting for X_test
y_pred_lass =lasso_reg.predict(X_test)
#Printing the Score with RMLSE
print("nnLasso SCORE : ", score(y_pred_lass, actual_cost))

Output

0.7335508027883148

The Lasso Regression attained an accuracy of 73% with the given Dataset.

Lasso Regression in R

We’ll use the Boston Housing dataset, which is a classic dataset used for regression tasks. It includes information about various factors that might influence the median value of homes in different Boston neighborhoods.

Columns/Features:

crim: Per capita crime rate by town.

zn: Proportion of residential land zoned for lots over 25,000 sq. ft.

indus: Proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: Nitrogen oxides concentration (parts per 10 million).

rm: Average number of rooms per dwelling.

age: Proportion of owner-occupied units built prior to 1940.

dis: Weighted distances to five Boston employment centers.

rad: Index of accessibility to radial highways.

tax: Full-value property tax rate per $10,000.

ptratio: Pupil-teacher ratio by town.

b: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

lstat: Percentage lower status of the population.

medv: Median value of owner-occupied homes in $1000s (target variable).

Code Implementation

# Load necessary libraries

library(glmnet)

library(caret)

# Load the Boston Housing dataset

data(Boston)

head(Boston)

# Data preprocessing

X <- as.matrix(Boston[, -14])  # Features

Y <- Boston$medv  # Target variable

# Splitting data into training and test sets

set.seed(123)

train_index <- createDataPartition(Y, p = 0.8, list = FALSE)

train_data <- X[train_index, ]

test_data <- X[-train_index, ]

train_label <- Y[train_index]

test_label <- Y[-train_index]

# Building Lasso Regression model

lasso_model <- glmnet(train_data, train_label, alpha = 1)

# Predicting on test set

predictions <- predict(lasso_model, newx = test_data, s = 0.01)

# Evaluating the model (using Root Mean Squared Error as an example)

rmse <- sqrt(mean((predictions - test_label)^2))

# Printing the results

cat("Lasso Regression RMSE:", rmse, "\n")

# Plotting coefficients

plot(lasso_model, xvar = "lambda", main = "Lasso Coefficients Path")

Explanation

1. Loading Libraries and Dataset:
We load the necessary libraries glmnet for Lasso Regression and caret for data partitioning.
The Boston Housing dataset is loaded using data(Boston).

2. Data Preprocessing:
We convert the dataset into matrices (X for features excluding the target variable medv, and Y for the target variable).

3. Splitting Data:
We split the data into training and test sets using createDataPartition from caret, keeping 80% for training.

4. Building Lasso Regression Model:
We create a Lasso Regression model using glmnet with alpha = 1 (to enforce Lasso regularization).

5. Predicting on Test Set:
We make predictions on the test set using the trained Lasso model.

6. Evaluating the Model:
We evaluate the model’s performance using Root Mean Squared Error (RMSE) between predicted and actual house prices.

7. Output:
The RMSE value is printed as a measure of model accuracy.

8. Visualizing Coefficients:
Lastly, we plot the Lasso coefficient path over different values of lambda (regularization parameter), which helps in understanding which features are important in predicting house prices.

Output

Lasso Regression RMSE: 4.61534

This indicates the root mean squared error of the Lasso Regression model on predicting house prices in the test set.

Difference Between Ridge Regression and Lasso Regression

Ridge RegressionLasso Regression
The penalty term is the sum of the squares of the coefficients (L2 regularization).The penalty term is the sum of the absolute values of the coefficients (L1 regularization).
Shrinks the coefficients but doesn’t set any coefficient to zero.Can shrink some coefficients to zero, effectively performing feature selection.
Helps to reduce overfitting by shrinking large coefficients.Helps to reduce overfitting by shrinking and selecting features with less importance.
Works well when there are a large number of features.Works well when there are a small number of features.
Performs “soft thresholding” of coefficients.Performs “hard thresholding” of coefficients.

In short, Ridge is a shrinkage model, and Lasso is a feature selection model. Ridge tries to balance the bias-variance trade-off by shrinking the coefficients, but it does not select any feature and keeps all of them. Lasso tries to balance the bias-variance trade-off by shrinking some coefficients to zero.

In this way, Lasso can be seen as an optimizer for feature selection.

Interpretations and Generalizations

Interpretations:

  1. Geometric Interpretations
  2. Bayesian Interpretations
  3. Convex relaxation Interpretations
  4. Making λ easier to interpret with an accuracy-simplicity tradeoff

Generalizations

  1. Elastic Net
  2. Group Lasso
  3. Fused Lasso
  4. Adaptive Lasso
  5. Prior Lasso
  6. Quasi-norms and bridge regression

Linear Regression vs Logistic Regression Video

Conclusion


LASSO regression emerges as a crucial technique for statistical modeling and machine learning, striking a balance between model simplicity and accuracy.

With its ability to promote sparsity through feature selection, LASSO regression aids in identifying relevant variables and managing overfitting, particularly in high-dimensional datasets. To deepen your understanding and enhance your skills, explore free certification courses on statistical modeling and machine learning techniques.
To deepen your understanding of LASSO and other essential skills, explore Great Learning’s free Python courses, providing flexible learning options and opportunities to enhance your expertise.

For those aspiring to secure lucrative careers in the field of artificial intelligence and machine learning, the PG Program in Artificial Intelligence & Machine Learning offers comprehensive training, weekly online mentorship by experts, project assistance, and networking opportunities, helping you to develop a robust skill set and advance your career prospects

What is Lasso regression used for?

Lasso regression is used for eliminating automated variables and the selection of features. 

What is lasso and ridge regression?

Lasso regression makes coefficients to absolute zero; while ridge regression is a model turning method that is used for analyzing data suffering from multicollinearity

What is Lasso Regression in machine learning?

Lasso regression makes coefficients to absolute zero; while ridge regression is a model turning method that is used for analyzing data suffering from multicollinearity

Why does Lasso shrink zero?

The L1 regularization performed by Lasso, causes the regression coefficient of the less contributing variable to shrink to zero or near zero.

Is lasso better than Ridge?

Lasso is considered to be better than ridge as it selects only some features and decreases the coefficients of others to zero.

How does Lasso regression work?

Lasso regression uses shrinkage, where the data values are shrunk towards a central point such as the mean value.

What is the Lasso penalty?

The Lasso penalty shrinks or reduces the coefficient value towards zero. The less contributing variable is therefore allowed to have a zero or near-zero coefficient.

Is lasso L1 or L2?

A regression model using the L1 regularization technique is called Lasso Regression, while a model using L2 is called Ridge Regression. The difference between these two is the term penalty.

Is lasso supervised or unsupervised?

Lasso is a supervised regularization method used in machine learning.

→ Explore this Curated Program for You ←

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Recommended AI Courses

MIT No Code AI and Machine Learning Program

Learn Artificial Intelligence & Machine Learning from University of Texas. Get a completion certificate and grow your professional career.

4.70 ★ (4,175 Ratings)

Course Duration : 12 Weeks

AI and ML Program from UT Austin

Enroll in the PG Program in AI and Machine Learning from University of Texas McCombs. Earn PG Certificate and and unlock new opportunities

4.73 ★ (1,402 Ratings)

Course Duration : 7 months

Scroll to Top