AdaBoost Ensemble for Bank Marketing Campaign ROI¶

Cameron Batts — AdaBoostClassifier · One-Hot Encoding · Custom Value Function

In [1]:
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
In [2]:
bank_data = pd.read_csv("data/bank_data.csv")
bank_data.head()
Out[2]:
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
1 44;"technician";"single";"secondary";"no";29;"...
2 33;"entrepreneur";"married";"secondary";"no";2...
3 47;"blue-collar";"married";"unknown";"no";1506...
4 33;"unknown";"single";"unknown";"no";1;"no";"n...
In [3]:
bank_data = pd.read_csv("data/bank_data.csv", sep=";")
bank_data.head()
Out[3]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
In [4]:
try:
    bank_data.drop(['day', 'month', 'duration', 'pdays', 'poutcome'], axis=1, inplace=True)
except NameError:
    print('The object `bank_data` does not exist!')

Here, we encode the response variable (whether or not a customer subscribed a term deposit with the bank following a marketing campaign) with a binary indicator.

In [5]:
try:
    bank_data["y"] = bank_data['y'].apply(lambda y: 1 if y == 'yes' else 0)
    bank_data.head()
except NameError:
    print('The object `bank_data` does not exist!')
In [6]:
bank_data = pd.get_dummies(bank_data)
bank_data.head()
Out[6]:
age balance campaign previous y job_admin. job_blue-collar job_entrepreneur job_housemaid job_management ... education_unknown default_no default_yes housing_no housing_yes loan_no loan_yes contact_cellular contact_telephone contact_unknown
0 58 2143 1 0 0 False False False False True ... False True False False True True False False False True
1 44 29 1 0 0 False False False False False ... False True False False True True False False False True
2 33 2 1 0 0 False False True False False ... False True False False True False True False False True
3 47 1506 1 0 0 False True False False False ... True True False False True True False False False True
4 33 1 1 0 0 False False False False False ... True True False True False True False False False True

5 rows × 33 columns

Let's now separate the predictors from the response variable.

In [7]:
try:
    X = bank_data.drop('y', axis=1)
    Y = bank_data['y']
except NameError:
    print('The object `bank_data` does not exist!')
In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42, test_size=0.25)

We want to use the available data to build a predictive model that can assist us in making our next marketing campaign more efficient.

Here are some quantities to keep in mind. Our department estimates that:

  • a marketing contact with a potential customer costs around 10 Euros on average

  • a successful contact (i.e. the customer subscribes a term deposit) generates on average 100 Euros of profits for the bank (say, present value net the cost of marketing).

Accordingly, we estimate that:

  • the value associated with a true negative prediction from our model is 10 Euros (it saves us the waste of 10 Euros associated with the marketing contact)

  • the value associated with a false positive prediction from our model is -10 Euros

  • the value associated with a false negative prediction from our model is -100 Euros

  • the value associated with a true positive prediction from our model is +100 Euros.

Let's encode this information in a "value function" that we will use later.

In [9]:
def value_function(y_true, y_pred, tn_value=10, fp_value=-10, fn_value=-100, tp_value=100):
    sum_ = y_pred + y_true
    diff_ = y_pred - y_true
    tn_contrib = tn_value * np.mean((sum_ == 0) & (diff_ == 0))
    fp_contrib = fp_value * np.mean((sum_ == 1) & (diff_ == 1))
    fn_contrib = fn_value * np.mean((sum_ == 1) & (diff_ == -1))
    tp_contrib = tp_value * np.mean((sum_ == 2) & (diff_ == 0))
    return tn_contrib + fp_contrib + fn_contrib + tp_contrib
In [10]:
ada_boost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(random_state=42, max_depth=5),
    n_estimators=2000,
    learning_rate=0.80,
    random_state=42)
ada_boost.fit(X_train, Y_train)
Out[10]:
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=5,
                                                    random_state=42),
                   learning_rate=0.8, n_estimators=2000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=5,
                                                    random_state=42),
                   learning_rate=0.8, n_estimators=2000, random_state=42)
DecisionTreeClassifier(max_depth=5, random_state=42)
DecisionTreeClassifier(max_depth=5, random_state=42)

We will also fit a conventional decision tree for reference.

In [11]:
try:
    tree = DecisionTreeClassifier(random_state=42).fit(X_train, Y_train)
except NameError:
    print('The objects `X_train, Y_train` do not exist!')
In [12]:
ada_value = value_function(Y_test.values, ada_boost.predict(X_test))
tree_value = value_function(Y_test.values, tree.predict(X_test))

print('AdaBoost value function:', ada_value)
print('Decision Tree value function:', tree_value)

# the decision tree scored higher so it looks like its doing better here than adaboost
AdaBoost value function: -0.7431655312748837
Decision Tree value function: 0.9926568167743075

Let's now try to quantify what is the monetary impact of using the AdaBoostClassifier as opposed to the DecisionTreeClassifier on our marketing campaign.

First off, for the evaluation of a given model, we will assume that in our marketing campaign we will only contact customers that are predicted as subscribers by our model.

With this in mind, let's create a "marketing campaign profit function".

In [13]:
def marketing_profits(model, X, Y, fp_value=-10, tp_value=100):
    tp_contrib = np.sum((model.predict(X) > 0) & (Y > 0)) * tp_value
    fp_contrib = np.sum((model.predict(X) > 0) & (Y < 1)) * fp_value
    return tp_contrib + fp_contrib
In [14]:
ada_prof = marketing_profits(ada_boost, X_test, Y_test)
tree_prof = marketing_profits(tree, X_test, Y_test)
percent_diff = ((ada_prof - tree_prof) / tree_prof) * 100
print('Percent change in profits using AdaBoost vs Decision Tree:', percent_diff, '%')
Percent change in profits using AdaBoost vs Decision Tree: -41.72692471288813 %
In [15]:
def _value_function(y, y_pred, **kwargs):
    return value_function(y, y_pred, **kwargs)


value_function_wrapper = make_scorer(_value_function)
In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# for my model i went with random forest since its basically just a lot of decision trees
# working together which should hopefully do better than one tree alone

# the data is imbalanced (only like 10% said yes) so im using class_weight balanced
# to make the model not just ignore the yes group

rf = RandomForestClassifier(class_weight='balanced', random_state=42)

# trying out a few different values for number of trees and max depth
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None]
}

# need to set up the value function as a scorer so GridSearchCV can use it
def _value_function(y, y_pred, **kwargs):
    return value_function(y, y_pred, **kwargs)

value_function_wrapper = make_scorer(_value_function)

# running grid search with 5 fold cv using our value function as the metric
# this way the model is tuned based on what actually matters for this problem
grid_search = GridSearchCV(rf, param_grid, scoring=value_function_wrapper, cv=5)
grid_search.fit(X_train, Y_train)

print('Best parameters:', grid_search.best_params_)
print('Best CV score:', grid_search.best_score_)

best_rf = grid_search.best_estimator_

# now comparing all 3 models on the test set with the value function
rf_val = value_function(Y_test.values, best_rf.predict(X_test))
ada_val = value_function(Y_test.values, ada_boost.predict(X_test))
tree_val = value_function(Y_test.values, tree.predict(X_test))

print('\nValue Function Results:')
print('AdaBoost:', ada_val)
print('Decision Tree:', tree_val)
print('Random Forest (tuned):', rf_val)

# and now checking the marketing profits for each model
rf_prof = marketing_profits(best_rf, X_test, Y_test)
ada_prof = marketing_profits(ada_boost, X_test, Y_test)
tree_prof = marketing_profits(tree, X_test, Y_test)

print('\nMarketing Profits:')
print('AdaBoost:', ada_prof)
print('Decision Tree:', tree_prof)
print('Random Forest (tuned):', rf_prof)

# percent change vs decision tree
rf_vs_tree = ((rf_prof - tree_prof) / tree_prof) * 100
print('\nRF vs Decision Tree profit change:', round(rf_vs_tree, 2), '%')

# percent change vs adaboost
rf_vs_ada = ((rf_prof - ada_prof) / ada_prof) * 100
print('RF vs AdaBoost profit change:', round(rf_vs_ada, 2), '%')
Best parameters: {'max_depth': 5, 'n_estimators': 200}
Best CV score: 7.150506182578337

Value Function Results:
AdaBoost: -0.7431655312748837
Decision Tree: 0.9926568167743075
Random Forest (tuned): 6.952136600902416

Marketing Profits:
AdaBoost: 13700
Decision Tree: 23510
Random Forest (tuned): 57190

RF vs Decision Tree profit change: 143.26 %
RF vs AdaBoost profit change: 317.45 %