LASSO Regularization & Variable Selection via Cross-Validation¶
Cameron Batts — LASSO · StandardScaler Pipeline · GridSearchCV · OLS Comparison
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import (
Lasso,
LinearRegression
)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
Let's simulate some data for a regression problem.
n = 1000 # number of observations in the data
p = 100 # number of predictors in the data
k = 10 # number of relevant predictors in the data
X, y, coef = make_regression(n_samples=n,
n_features=p,
n_informative=k,
noise=0.005, # add a little gaussian noise to the data
coef=True,
random_state=42)
Keep in mind that we generate the data here, therefore we know the "truth" about the data generating process: only 10 predictors are relevant out of the $p = 100$ predictors available in the data.
In fact, here are the true coefficients of the data generating process (only the 10 non-zero coefficients are printed):
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))
beta_6: 8.88918861576965 beta_15: 19.365117777553486 beta_47: 48.24477094388765 beta_56: 82.84660305904961 beta_66: 1.8559304095025264 beta_73: 29.506960083972224 beta_80: 68.28011328758366 beta_85: 78.46795175040386 beta_87: 25.37933359399561 beta_90: 57.00013284018336
We would expect that given a sample of $n = 1000$ data points from this data generating process, a LASSO model will be able to select exactly 10 predictors and discard the remaining ones (i.e. set their $\widehat{\beta}$ estimates to 0).
Let's verify empirically whether this is in fact the case.
lasso_pipeline = make_pipeline(StandardScaler(),
Lasso())
Here we create a grid of 50 (by default) equally spaced (in the log scale) candidate values for the regularization parameter $\alpha$.
alpha_candidates = np.logspace(-4.0, -2.0)
print(len(alpha_candidates))
50
alpha_search = GridSearchCV(lasso_pipeline,
{'lasso__alpha': alpha_candidates},
cv=2).fit(X, y)
Here we count how many of the model coefficients for the best LASSO model (i.e. the one corresponding to the optimal value of $\alpha$ selected via cross-validation) are non-zero.
try:
print(np.sum(alpha_search.best_estimator_[1].coef_ != 0))
except NameError:
print('The object `alpha_search` does not exist!')
10
In particular, which coefficients are non-zero?
try:
print('\n'.join([f'beta_hat_{i}: {alpha_search.best_estimator_[1].coef_[i]}'
for i in range(p)
if alpha_search.best_estimator_[1].coef_[i] != 0.0]))
except NameError:
print('The object `alpha_search` does not exist!')
beta_hat_6: 8.687958221359063 beta_hat_15: 19.726802408850176 beta_hat_47: 48.34738538280529 beta_hat_56: 84.70297559430581 beta_hat_66: 1.9054249244538664 beta_hat_73: 28.963729026956305 beta_hat_80: 62.93734556951447 beta_hat_85: 76.56912272269763 beta_hat_87: 25.280346286439105 beta_hat_90: 58.5058168712546
For reference, here are the true $\beta$ values that these coefficients are trying to estimate:
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))
beta_6: 8.88918861576965 beta_15: 19.365117777553486 beta_47: 48.24477094388765 beta_56: 82.84660305904961 beta_66: 1.8559304095025264 beta_73: 29.506960083972224 beta_80: 68.28011328758366 beta_85: 78.46795175040386 beta_87: 25.37933359399561 beta_90: 57.00013284018336
Let's see what would happen if we fitted a conventional multiple regression model on these data instead.
linear_regression = LinearRegression().fit(X, y)
np.sum(linear_regression.coef_ != 0)
np.int64(100)
The Lasso model is the perfered method here and learned best compared to the standard linear regression model as was evident when we looked at the amount of non-zero coefficients where the Linear Regression had 100 and the Lasso only had 10. This signifies that the Lasso method is better at filtering out the noise compared to the Linear regression which tried to overfit and include more noise which dilutes the accurate data/alpha in this case.