LASSO Regularization & Variable Selection via Cross-Validation¶

Cameron Batts — LASSO · StandardScaler Pipeline · GridSearchCV · OLS Comparison

In [1]:
import numpy as np

from sklearn.datasets import make_regression
from sklearn.linear_model import (
    Lasso,
    LinearRegression
)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

Let's simulate some data for a regression problem.

In [2]:
n = 1000  # number of observations in the data
p = 100   # number of predictors in the data
k = 10    # number of relevant predictors in the data
In [3]:
X, y, coef = make_regression(n_samples=n,
                             n_features=p,
                             n_informative=k,
                             noise=0.005,  # add a little gaussian noise to the data
                             coef=True,
                             random_state=42)

Keep in mind that we generate the data here, therefore we know the "truth" about the data generating process: only 10 predictors are relevant out of the $p = 100$ predictors available in the data.

In fact, here are the true coefficients of the data generating process (only the 10 non-zero coefficients are printed):

In [4]:
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))
beta_6: 8.88918861576965
beta_15: 19.365117777553486
beta_47: 48.24477094388765
beta_56: 82.84660305904961
beta_66: 1.8559304095025264
beta_73: 29.506960083972224
beta_80: 68.28011328758366
beta_85: 78.46795175040386
beta_87: 25.37933359399561
beta_90: 57.00013284018336

We would expect that given a sample of $n = 1000$ data points from this data generating process, a LASSO model will be able to select exactly 10 predictors and discard the remaining ones (i.e. set their $\widehat{\beta}$ estimates to 0).

Let's verify empirically whether this is in fact the case.

In [5]:
lasso_pipeline =  make_pipeline(StandardScaler(), 
    Lasso())

Here we create a grid of 50 (by default) equally spaced (in the log scale) candidate values for the regularization parameter $\alpha$.

In [6]:
alpha_candidates = np.logspace(-4.0, -2.0)
In [7]:
print(len(alpha_candidates))
50
In [8]:
alpha_search = GridSearchCV(lasso_pipeline,
             {'lasso__alpha': alpha_candidates},
             cv=2).fit(X, y)

Here we count how many of the model coefficients for the best LASSO model (i.e. the one corresponding to the optimal value of $\alpha$ selected via cross-validation) are non-zero.

In [9]:
try:
    print(np.sum(alpha_search.best_estimator_[1].coef_ != 0))
except NameError:
    print('The object `alpha_search` does not exist!')
10

In particular, which coefficients are non-zero?

In [10]:
try:
    print('\n'.join([f'beta_hat_{i}: {alpha_search.best_estimator_[1].coef_[i]}'
           for i in range(p)
           if alpha_search.best_estimator_[1].coef_[i] != 0.0]))
except NameError:
    print('The object `alpha_search` does not exist!')
beta_hat_6: 8.687958221359063
beta_hat_15: 19.726802408850176
beta_hat_47: 48.34738538280529
beta_hat_56: 84.70297559430581
beta_hat_66: 1.9054249244538664
beta_hat_73: 28.963729026956305
beta_hat_80: 62.93734556951447
beta_hat_85: 76.56912272269763
beta_hat_87: 25.280346286439105
beta_hat_90: 58.5058168712546

For reference, here are the true $\beta$ values that these coefficients are trying to estimate:

In [11]:
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))
beta_6: 8.88918861576965
beta_15: 19.365117777553486
beta_47: 48.24477094388765
beta_56: 82.84660305904961
beta_66: 1.8559304095025264
beta_73: 29.506960083972224
beta_80: 68.28011328758366
beta_85: 78.46795175040386
beta_87: 25.37933359399561
beta_90: 57.00013284018336

Let's see what would happen if we fitted a conventional multiple regression model on these data instead.

In [12]:
linear_regression = LinearRegression().fit(X, y)
In [13]:
np.sum(linear_regression.coef_ != 0)
Out[13]:
np.int64(100)

The Lasso model is the perfered method here and learned best compared to the standard linear regression model as was evident when we looked at the amount of non-zero coefficients where the Linear Regression had 100 and the Lasso only had 10. This signifies that the Lasso method is better at filtering out the noise compared to the Linear regression which tried to overfit and include more noise which dilutes the accurate data/alpha in this case.