Introduction

  • The goal of this task is to fit a statistical model to historical credit data and then use the model to estimate the value of current loans.

  • Since this is a binary classification problem where the target variable is either DEFAULT or PAID, I will use Logistics Regression.

  • In addition to model building, I will illustrate how sklearn’s Pipeline is more convenient and simple to use than manually doing all transformation steps - especially for Cross Validation and Hyperparameters Tuning

Data descriptions

Column account_no
account_no A unique account number per loan
gender The gender of the account holder - either "M" or "F"
age The age of the account holder at the point of application
income The monthly net income of the account holder at the point of application
loan_amount The amount of money lent
term The number of months that the loan is to be repaid over
installment_amount The monthly installment amount
interest_rate The interest rate on the loan
credit_score_at_application The credit score at the point of application, this is a positive integer less than 1000. The higher the score the more creditworthy the applicant is believed to be
outstanding_balance The remaining amount of the loan that still has to be repaid
status This indicates what state the account is in. This field can take one of three values
- "LIVE" : The loan is still being repaid - the field outstanding_balance will be greater than zero.
- "PAID_UP": The loan has been completely repaid - the field outstanding_balance will be zero.
- "DEFAULT": The loan was not fully repaid and no further payments can be expected - the field outstanding_balance will be greater than zero and the amount will not be recoverable.

0. Imports and read data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_curve
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
df = pd.read_csv('./data/loan_default.csv')
df.head()
index account_no gender age income loan_amount term installment_amount interest_rate credit_score_at_application outstanding_balance status
0 acc_00000316 F 18 12143 47000 60 1045 0.12 860 0 PAID_UP
1 acc_00000422 F 18 6021 13000 60 330 0.18 640 0 PAID_UP
2 acc_00001373 F 39 12832 13000 60 296 0.13 820 0 PAID_UP
3 acc_00001686 F 33 4867 5000 36 191 0.22 540 0 PAID_UP
4 acc_00001733 F 23 5107 22000 36 818 0.20 580 11314 LIVE

1. Exploratory Data Analysis

  • account_no column is just an unique identifier and has no predictive power, we will drop it
  • outstanding_balance will have value =0 for PAID_UP status. Hence we will need to drop this column as well to prevent data leakage.

1.1 Remove LIVE rows

  • As the business goal of this task is to estimate value of current loans, we will drop rows where status column is LIVE.
  • We will train the model on terminal loan statuses and run the prediction on this LIVE dataset later.
df_full_train = df.drop(df.loc[df['status']=='LIVE'].index)
df_live = df.loc[df['status']=='LIVE']

The proportion of DEFAULT is about 7%. This level is used to sense check with the final model.

df_full_train['status'].value_counts(normalize=True).round(3)

>>> PAID_UP    0.931
>>> DEFAULT    0.069
>>> Name: status, dtype: float64

Since the target variable is either PAID_UP or DEFAULT, we can simply replace the values. Since we are interested in DEFAULT, I will make it the positive class with value 1

df_full_train['status'] = df_full_train['status'].map({
    'PAID_UP': 0,
    'DEFAULT': 1
})

df_full_train.head()
index account_no gender age income loan_amount term installment_amount interest_rate credit_score_at_application outstanding_balance status
0 acc_00000316 F 18 12143 47000 60 1045 0.12 860 0 0
1 acc_00000422 F 18 6021 13000 60 330 0.18 640 0 0
2 acc_00001373 F 39 12832 13000 60 296 0.13 820 0 0
3 acc_00001686 F 33 4867 5000 36 191 0.22 540 0 0
5 acc_00002114 M 38 9328 25000 36 904 0.18 630 0 0
X = df_full_train.drop(['account_no', 'outstanding_balance', 'status'], axis=1)
y = df_full_train['status']

1.2 Univariate distributions of features

NUMERICAL = list(df.select_dtypes(include='number').columns)
CATEGORICAL = list(df.select_dtypes(include='object').columns)
print(NUMERICAL)
print(CATEGORICAL)

>>> ['age', 'income', 'loan_amount', 'term', 'installment_amount', 'interest_rate', 'credit_score_at_application', 'outstanding_balance']
>>> ['account_no', 'gender', 'status']

To understand the distributions of numerical data, we can plot individual historgrams

  • Since I want to interpret the coefficients, I will just leave them as is for the base model instead of transforming.
  • For this dataset, we don’t have issue with outliers except for outstanding_balance. But we will not be using this feature anyway.
fig, axes = plt.subplots(2, 4, figsize=(12, 6))

for idx, col_name in enumerate(NUMERICAL):
    ax = axes[idx//4, idx%4]
    ax.set_title(col_name)
    ax.hist(df[col_name], ec='white')
    ax.axvline(x=df[col_name].mean(), ls='--', color='k')
    sns.despine(ax=ax)

fig.tight_layout()

png

1.3 Bivariate relationship between features and target variable

  • To understand the relationship between each individual feature with the target, we can plot a correlation chart. However, since the target is binary, I will use boxplot instead.
  • In the charts below, income, interest_rate and credit_score_at_application are the strongest predictors of whether the loanee will default.
fig2, axes2 = plt.subplots(1, len(NUMERICAL[:-1]), figsize=(16, 6))

for idx, col_name in enumerate(NUMERICAL[:-1]):
    sns.boxplot(
        df_full_train[[col_name, 'status']],
        x='status',
        y=col_name,
        showfliers=False,
        ax=axes2[idx])
    
    axes2[idx].set_xlabel('is_defaulted')
    sns.despine(ax=axes2[idx])

fig2.tight_layout()

png

2. Manual fitting

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

2.1 Preprocessing

2.1.1 Check for NaN/Nulls values

X_train.isnull().sum()

>>> gender                         0
    age                            0
    income                         0
    loan_amount                    0
    term                           0
    installment_amount             0
    interest_rate                  0
    credit_score_at_application    0
    dtype: int64

2.1.2 Categorical data

We will use one-hot encoding for gender. 0 represents Female, 1 represents Male

gender_encoder = OneHotEncoder(drop='first')
gender_encoder.fit(X_train['gender'].values.reshape(-1, 1))
gender_encoder.categories_

>>> [array(['F', 'M'], dtype=object)]
# encode column
encoded_gender = gender_encoder.transform(X_train['gender'].values.reshape(-1, 1)).toarray()

# convert encoded columns into dataframe
encoded_gender_df = pd.DataFrame(
    encoded_gender,
    columns=gender_encoder.get_feature_names_out()
)

# merge original df with encoded df
X_train = pd.concat(
    [X_train, encoded_gender_df],
    axis=1,
)

# drop original column
X_train = X_train.drop(['gender'], axis=1)
X_train.head()
index age income loan_amount term installment_amount interest_rate credit_score_at_application x0_M
0 33 16685 55000 60 1223 0.12 850 1.0
1 27 15050 28000 24 1331 0.13 840 1.0
2 50 40203 62000 48 1345 0.02 1000 1.0
3 26 13754 39000 48 1066 0.14 810 1.0
4 30 11830 31000 60 705 0.13 810 0.0

2.1.3 Numerical data

For the base model, we keep numerical data intact.

2.2 Fitting Logistics Regression Model

log_reg = LogisticRegression(max_iter=1_000)
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

2.3 Model coefficients

An advantage of Logistics Regression is that the coefficients are white box, making it easier for stakeholders to understand why the model makes a certain prediction.

We can write the full formula based on the fitted model’s intercept and coefficients

log_reg.intercept_
>>> array([1.22179855])
model_coefficients = dict(zip(
    list(log_reg.feature_names_in_),
    list(log_reg.coef_[0]))
    )

model_coefficients

>>> {'age': 0.057494705504614954,
     'income': -0.0032484557471085074,
     'loan_amount': -2.8753044215366256e-05,
     'term': 0.022543857502140222,
     'installment_amount': 0.0009745065842801654,
     'interest_rate': 0.6331692246041164,
     'credit_score_at_application': 0.0165512445820803,
     'x0_M': 7.976349551593769}

Hence the full formula is:

\[z = 1.222 + age * 0.057 +income * -0.003 +loan\_amount * -0.0 +term * 0.023 +installment\_amount * 0.001 +interest\_rate * 0.633 +credit\_score\_at\_application * 0.017 +x0\_M * 7.976\] \[prob(default) = \frac{1}{1 + e^{-z}}\]

To make a sample prediction, we can apply the above formula as follow:

sample_data = {
    'age': 30,
    'income': 10_000,
    'loan_amount': 20_000,
    'term': 60,
    'installment_amount': 1_000,
    'interest_rate': 0.15,
    'credit_score_at_application': 800,
    'x0_M': 1,
}


def sigmoid(n):
    return 1 / (1 + np.e ** (-n))

z = 1.22179855 + (np.array(list(model_coefficients.values())) * np.array(list(sample_data.values()))).sum()

print('probability of default for sample data')
print(sigmoid(z))
>>> probability of default for sample data
>>> 0.0015414024478882305

To verify that the function works, we can compare predicted default probability by using the model’s .predict() method

>>> log_reg.predict_proba(np.array(list(sample_data.values())).reshape(-1, 8))[:, 1][0]
>>> 0.0015414024462043266

2.4 Making predictions

Now that we understand how the model works, let’s see how it performs on test data.

In order to run the model on test data, we need to perform the same transformation steps above.

# encode column
encoded_gender = gender_encoder.transform(X_test['gender'].values.reshape(-1, 1)).toarray()

# convert encoded columns into dataframe
encoded_gender_df = pd.DataFrame(
    encoded_gender,
    columns=gender_encoder.get_feature_names_out()
)

# merge original df with encoded df
X_test = pd.concat(
    [X_test, encoded_gender_df],
    axis=1,
)

# drop original column
X_test = X_test.drop(['gender'], axis=1)
X_test.head()
index age income loan_amount term installment_amount interest_rate credit_score_at_application x0_M
0 37 14163 28000 60 652 0.14 780 1.0
1 27 10769 33000 24 1584 0.14 790 0.0
2 28 10747 34000 48 929 0.14 790 0.0
3 22 14439 49000 24 2284 0.11 910 0.0
4 32 6400 17000 36 649 0.22 540 1.0
y_preds = log_reg.predict(X_test)
y_preds
>>> array([0, 0, 0, ..., 0, 0, 0])

2.5 Evaluate model

Accuracy, Recall, Precision, f1-score

def evaluate_model(y_true, y_pred):    
    result = {
        'accuracy': accuracy_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'f1_score': f1_score(y_true, y_pred),
    }

    return result

evaluate_model(y_test, y_preds)
>>> {'accuracy': 0.9705,
>>> 'recall': 0.8035714285714286,
>>> 'precision': 0.8385093167701864,
>>> 'f1_score': 0.8206686930091185}

ROC curve

fpr, tpr, thres = roc_curve(y_test, log_reg.predict_proba(X_test)[:, 1])

fig, ax = plt.subplots()
ax.set_aspect(1)
ax.plot(fpr, tpr, label='logistics regression model')
ax.plot(np.linspace(0, 1, 1_000), np.linspace(0, 1, 1_000), ls='--', label='random guess')
ax.legend()
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')

png

At this point we have a working model.

However in practice we rarely build just 1 model. We will need to experiment with different types of tranformation, and tuning hyperparameters

As the number of transformations and parameters increases, we will need to keep track of more objects. This is cumbersome and prone to error. That’s why we will use a Pipeline.

3. Using Pipelines

We can build the same model as above using fewer lines of code thanks to Pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

3.1 Column transformers

# define the column transformer to encode gender column
column_transformer = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(drop='first'), ['gender'])
    ],
    remainder='passthrough'
)

3.2 Build and fit pipeline

# define the pipeline with column transformer and logistic regression
pipeline = Pipeline([
    ('preprocessor', column_transformer),
    ('classifier', LogisticRegression(max_iter=1_000))
])

# fit the pipeline to the training data
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

evaluate_model(y_test, preds)
>>> {'accuracy': 0.9705,
>>> 'recall': 0.8035714285714286,
>>> 'precision': 0.8385093167701864,
>>> 'f1_score': 0.8206686930091185}

Here we can verify that pipeline produces the same result as manual steps by comparing predicted output and metrics.

3.3 Cross validation with Pipeline

  • Cross-validation is a technique that is used to evaluate the performance of the model. In cross-validation, the data is split into k-folds, where each fold is used for testing at least once. For example, if 5-fold cross-validation is used, then in the first iteration, fold 1 may be used for testing and folds 2–5 for training. In the second iteration, fold 2 may be used for testing, the rest of the folds for training the model, and so on until all the folds are used for testing at least once.

  • K-fold cross-validation can be especially useful when the dataset is small or imbalanced, as it allows for the model to be evaluated on all of the data, even if some observations have low representation in the dataset.

# Define the cross-validation strategy
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Perform cross-validation
acc_scores = cross_val_score(pipeline, X_train, y_train, cv=cv)

# Compute the mean and standard deviation of the scores
print("Accuracy: %0.2f (+/- %0.2f)" % (acc_scores.mean(), acc_scores.std() * 2))
>>> Accuracy: 0.97 (+/- 0.02)

3.4 Hyperparameters tuning

Hyperparameter tuning refers to the process of selecting the best hyperparameters for a machine learning model in order to achieve optimal performance. In the case of logistic regression classification, hyperparameters are values that are set before the learning process begins and affect the behavior of the model.

Here are some hyperparameters that we can tune for Logistics Regression:

  • Regularization strength: Regularization helps to prevent overfitting of the model by penalizing large coefficients. The regularization strength hyperparameter controls the amount of penalization applied.

  • Solver: The solver is the algorithm used to optimize the logistic regression objective function. There are several solvers available, such as ‘newton-cg’, ‘lbfgs’, and ‘liblinear’, each with different strengths and weaknesses.

  • Maximum number of iterations: The maximum number of iterations determines the maximum number of iterations for the solver to converge.

column_transformer = ColumnTransformer(
    transformers=[
        ('categorical_encoder', OneHotEncoder(drop='first'), ['gender']),
        ('numerical_scaler', StandardScaler(), NUMERICAL[:-1])

    ],
    remainder='passthrough'
)

my_pipeline = Pipeline([
    ('preprocessor', column_transformer),
    ('classifier', LogisticRegression(max_iter=10_000))
])


# Define the hyperparameters to tune
parameters = {
    'preprocessor__numerical_scaler__with_mean': [True, False],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__C': [0.1, 1, 10],
    'classifier__solver': ['liblinear', 'saga']
}


# Create a grid search object
grid_search = GridSearchCV(my_pipeline, parameters, cv=5, verbose=1)


# Fit the grid search object on the data
grid_search.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

>>> Best parameters:  {'classifier__C': 1, 'classifier__penalty': 'l1', 'classifier__solver': 'saga', 'preprocessor__numerical_scaler__with_mean': True}
>>> Best cross-validation score:  0.9728653846153847

3.5 Retrain piepline on the best params

Training the model on the best params give us improved metrics

column_transformer = ColumnTransformer(
    transformers=[
        ('categorical_encoder', OneHotEncoder(drop='first'), ['gender']),
        ('numerical_scaler', StandardScaler(with_mean=True), NUMERICAL[:-1])

    ],
    remainder='passthrough'
)

best_pipeline = Pipeline([
    ('preprocessor', column_transformer),
    ('classifier', LogisticRegression(penalty='l1', C=1, max_iter=6_000, solver='saga'))
])


# fit the pipeline to the training data
best_pipeline.fit(X_train, y_train)

preds = best_pipeline.predict(X_test)

evaluate_model(y_test, preds)

>>> {'accuracy': 0.971,
>>> 'recall': 0.8214285714285714,
>>> 'precision': 0.8313253012048193,
>>>'f1_score': 0.8263473053892216}

4. Predict model on LIVE loans data

df_live = df_live.drop(['account_no', 'status'], axis=1)
df_live = df_live.reset_index(drop=True)
df_live['predicted_default'] = best_pipeline.predict(df_live.drop(['outstanding_balance'], axis=1))
df_live['default_prob'] = best_pipeline.predict_proba(df_live.drop(['outstanding_balance'], axis=1))[:, 1].round(3)
df_live.head()
index gender age income loan_amount term installment_amount interest_rate credit_score_at_application outstanding_balance predicted_default default_prob
0 F 23 5107 22000 36 818 0.20 580 11314 0 0.043
1 F 40 15659 33000 48 853 0.11 880 5637 0 0.000
2 M 25 15660 15000 48 395 0.12 860 6039 0 0.000
3 F 30 4208 15000 48 481 0.23 530 2817 0 0.297
4 F 18 6535 12000 48 346 0.17 660 3120 0 0.001
expected_book_value = (df_live['outstanding_balance'] * (1 - df_live['default_prob'])).sum()
print(f"total outstanding balance =\t{df_live['outstanding_balance'].sum():,.2f}")
print(f"expected recoverable value =\t{expected_book_value:,.2f}")

>>> total outstanding balance =	24,622,111.00
>>> expected recoverable value = 23,809,056.25

5. Summary

What we did

  • We built a Logistics Regression for a classification of loan default.
  • We chose Logistics Regression because it’s a white box model with clear parameters, making iteasier to explain to business stakeholders.
  • We used sklearn’s Pipeline to make it easy to do Cross Validation and Hyperparameters Tuning

Improvement ideas

  • For the sake of simplicity I have not factored in time value of money. But this can be done by discounting each payment according to their expected due date.
  • By default, pipeline.predict() has a decision threshold of 0.5. Meaning that if pipeline.predict_proba() is above 0.5 then the data will be classified as default.
    • At a recall score of 0.82 we expect about 8 out of 10 defaults are identified.
    • We can improve recall by lowering the decision threshold. The cost will be more false positive (non-default classified as default)
    • Depending on the business objective, we can fine tune threshold level. If we want to be conservative with our estimate, then higher recall might be more acceptable.
  • We can also explore ensemble methods, such as xgboost, as it works well with tabular data