In this series I will be modeling Singapore’s HDB flats resale price. The first part is about building a model, following a standard ML problem process (EDA, Feature Engineering, Split train/test data, Fit model, Evaluation). The second part (coming soon™) will be about mlops. I will use mlflow to track and manage experiments & models.

Part 1 (This post)

  • Train different models to predict resale price for Singapore’s HDB

Part 2 (Coming soon™)

  • Run & Log experiments and models using mlflow
  • Save models to model registry
  • Load and serve the best model
What are HDB flats?
  • HDB (Housing and Development Board) buildings are public housing blocks in Singapore. They were built and managed by the Housing and Development Board (HDB), a statutory board under the Ministry of National Development.
  • HDB flats range from studio apartments to executive apartments, and are available for purchase or rent. Today 80% of Singapore's population live in HDB flats.
  • HDB flats are typically located in housing estates, which are self-contained communities with amenities such as schools, markets, and parks. The HDB also manages and maintains the estates, ensuring that they remain safe, clean and well-maintained.

0. Import and read data

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import geopandas as gpd
from shapely.geometry import Point
import folium

from math import radians, cos, sin, asin, sqrt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import LinearSVR

plt.style.use('seaborn-v0_8-white')
df = pd.read_csv('./data/intermediate/intermediate-data.csv')
df = df.drop('Unnamed: 0', axis=1)
df['town'] = df['town'].replace({'KALLANG/WHAMPOA': 'KALLANG'})

df.shape
(133473, 16)

1. Exploratory Data Analysis

Preview dataset

df.head().T
0 1 2 3 4
month 2017-01 2017-01 2017-01 2017-01 2017-01
town ANG MO KIO ANG MO KIO ANG MO KIO ANG MO KIO ANG MO KIO
flat_type 2 ROOM 3 ROOM 3 ROOM 3 ROOM 3 ROOM
block 406 108 602 465 601
street_name_x ANG MO KIO AVE 10 ANG MO KIO AVE 4 ANG MO KIO AVE 5 ANG MO KIO AVE 10 ANG MO KIO AVE 5
storey_range 10 TO 12 01 TO 03 01 TO 03 04 TO 06 01 TO 03
floor_area_sqm 44.0 67.0 67.0 68.0 67.0
flat_model Improved New Generation New Generation New Generation New Generation
lease_commence_date 1979 1978 1980 1980 1980
remaining_lease 61 years 04 months 60 years 07 months 62 years 05 months 62 years 01 month 62 years 05 months
resale_price 232000.0 250000.0 262000.0 265000.0 265000.0
postal 560406.0 560108.0 560602.0 560465.0 560601.0
latitude 1.362005 1.370966 1.380709 1.366201 1.381041
longitude 103.85388 103.838202 103.835368 103.857201 103.835132
building HDB-ANG MO KIO KEBUN BARU HEIGHTS YIO CHU KANG GREEN TECK GHEE HORIZON YIO CHU KANG GREEN
address 406 ANG MO KIO AVENUE 10 HDB-ANG MO KIO SINGAP... 108 ANG MO KIO AVENUE 4 KEBUN BARU HEIGHTS SIN... 602 ANG MO KIO AVENUE 5 YIO CHU KANG GREEN SIN... 465 ANG MO KIO AVENUE 10 TECK GHEE HORIZON SIN... 601 ANG MO KIO AVENUE 5 YIO CHU KANG GREEN SIN...

Checking for null/NaN data. In this case the percentage of rows having missing data are low, so it’s safe to drop them

df.isnull().mean()
month                  0.000000
town                   0.000000
flat_type              0.000000
block                  0.000000
street_name_x          0.000000
storey_range           0.000000
floor_area_sqm         0.000000
flat_model             0.000000
lease_commence_date    0.000000
remaining_lease        0.000000
resale_price           0.000000
postal                 0.018513
latitude               0.018513
longitude              0.018513
building               0.018513
address                0.018513
dtype: float64
df = df.drop(df[df['latitude'].isnull()].index)

Now let’s look at the overall distribution of resale price, henceforth also refered to as the target variable.

The distribution is right-skewed with few but very high values. The median value (SGD 440k) is lower than average (SGD 470k)

fig, axes = plt.subplots(2, 1, figsize=(11, 5), sharex=True, height_ratios=[1, 6])

df['resale_price'].hist(
    ec='white',
    linewidth=.5,
    bins=[_ * 50_000 for _ in range(30)],
    grid=False,
    ax=axes[1])

axes[1].axvline(df['resale_price'].mean(), ls='--', c='red', label=f"avg = SGD {df['resale_price'].mean():,.0f}")
axes[1].axvline(df['resale_price'].median(), ls='--', c='k', label=f"median = SGD {df['resale_price'].median():,.0f}")
axes[1].set_xticks([_ * 100_000 for _ in range(14)], labels=[_ * 100 for _ in range(14)])
axes[1].set_xlabel('Unit: thousands SGD')
axes[1].legend(fontsize=14)
axes[0].axis('off')
sns.despine(ax=axes[1])

sns.boxplot(df['resale_price'], orient="h", showfliers=False, ax=axes[0])
sns.despine(ax=axes[0], left=True, bottom=True)
fig.suptitle('HDB Resale Price Distribution')
fig.tight_layout()

png

Numerical variables

Correlations between each numeric variables and the target value give us a good sense of how much predicting power they have.

Unsurprisingly, bigger and newer flats are positively correlated with resale price

fig, ax = plt.subplots()
df.drop('resale_price', axis=1) \
    .corrwith(df['resale_price'], numeric_only=True) \
    .plot(kind='barh', ax=ax)
ax.axvline(x=0, c='k')
ax.set_title('Correlation with resale price')
sns.despine(ax=ax)

png

We can also study how much feature variables correlate with each other. If we solely care about having a good prediction, then it’s ok to have correlated features! However, if we want to study the impact of each variable, it’s helpful to remove redundant dimensions.

In the heatmap below, latitude and postal are highly correlated, we will drop at least 1 of them before fitting the models. However for now we will keep them to create a more useful feature.

corr_matrix = df.corr(numeric_only=True).round(3)

# remove the top triangle of the matrix to simplify the heatmap
mask = np.zeros_like(corr_matrix, dtype=bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    cmap=sns.color_palette("vlag_r", as_cmap=True),
    ax=ax)

png

Categorical variables

block, building, address contains too many categories. This might cause overfitting or computational issues then we will not be using those variables.

Perhaps the most useful feature we can use here is flat_type. The possible values are: 2 ROOM , 3 ROOM, 4 ROOM, 5 ROOM, Executive, Multi-generationl. There are a few ways to deal with this data. For simplicity, I will convert them to number of rooms so we can have a nice numeric value.

df.select_dtypes('object').nunique()
month                68
town                 26
flat_type             6
block              2580
street_name_x       556
storey_range         17
flat_model           21
remaining_lease     653
building            612
address            8896
dtype: int64
fig, ax = plt.subplots(figsize=(9, 4))

sns.boxplot(
    data=df,
    x='resale_price',
    y='flat_model',
    palette="Reds",
    showfliers=False,
    ax=ax).set_title('`flat_model` has too many categories')

png

fig, ax = plt.subplots(figsize=(9, 4))

sns.boxplot(
    data=df,
    x='resale_price',
    y='flat_type',
    palette="Reds",
    showfliers=False,
    ax=ax).set_title('`flat_type` has a few categories and has little overlap. Making this a good feature')

png

2. Feature engineering

Intuitively, we know that location is one of the most important factor in determining house prices. We will need a way to add this information to our model. To visualize the impact of location, I’ve created a choropleth map.

It’s visible from the map that the further towns have lower price range.

Show code
Make this Notebook Trusted to load map: File -> Trust Notebook


To determine how central a location it, I will calculate the straight line distance to a center point. For Singapore, I’ve selected the point having coordinate value CITY_CENTER = (1.28019, 103.85175). This point was picked by eyeballing on google map.

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    
    # constant used to convert to km
    r = 6371
    
    return c * r


df['distance_from_center'] = df.apply(
    lambda x: haversine(
        x['longitude'],
        x['latitude'],
        CITY_CENTER[1],
        CITY_CENTER[0]), axis=1)

Distance to the nearest MRT station

Other than distance to city center, close priximity to amenities and public transportations could also make the flats more attractive and hence having higher price. I will now calculate distance from the nearest MRT station to each of the HDB block.

To make calculations faster, I’ve (1) estimated the nearest location using Pythagorean theorem, and then (2) use the haversine formula above to determine actual distance in kilometers. This will save some calculation time.

# hdb_ll -> np.ndarray

# latitude is the x-axis
# longitude is the y-axis
hdb_ll = df[['longitude', 'latitude']].values

# find nearst mrt
x_delta = (mrt_map.geometry.x.values.reshape(-1, 1) - hdb_ll[:, 0])
y_delta = (mrt_map.geometry.y.values.reshape(-1, 1) - hdb_ll[:, 1])
delta = ((x_delta ** 2) + (y_delta)**2) ** .5

df['nearest_mrt_id'] = delta.argmin(axis=0)
df['nearest_mrt_location'] = df['nearest_mrt_id'].map(pd.Series(mrt_map.geometry.values, index=mrt_map.index).to_dict())
df['nearest_mrt_longitude'] = df['nearest_mrt_location'].apply(lambda val: val.x)
df['nearest_mrt_latitude'] = df['nearest_mrt_location'].apply(lambda val: val.y)

df['distance_to_mrt'] = df.apply(lambda x:
    haversine(
        x['longitude'],
        x['latitude'],
        x['nearest_mrt_longitude'],
        x['nearest_mrt_latitude']),
        axis=1)

df = df.drop(['nearest_mrt_longitude', 'nearest_mrt_latitude', 'nearest_mrt_id'], axis=1, errors='ignore')

Other features

Here’s the remaining features to process:

  • Convert remaining lease duration from year-month format to months
  • Create a new feature called number of bedrooms from flat type (explained above)
  • Averaged storey number since the original dataset list a range instead of exact floor number
  • Drop values for rare types
  • Drop highly correlated features
  • Use boxlots to determine appropriate feature range. Optionally drop outliers.
# convert `remaining_lease_months` from years months to months
df['remaining_lease'].str.contains('year').value_counts()
df['remaining_lease_months'] = df['remaining_lease'].str[:2].astype(int) * 12 + df['remaining_lease'].str[-9:-7].astype(int)


# convert flat type to number of rooms
# https://www.hdb.gov.sg/residential/buying-a-flat/finding-a-flat/types-of-flats
df['num_bedrooms'] = df['flat_type'].map({
    '1 ROOM': 0.5,
    '2 ROOM': 1,
    '3 ROOM': 2,
    '4 ROOM': 3,
    '5 ROOM': 4,
    'EXECUTIVE': 5,
    'MULTI-GENERATION': 6,
})

df = df.drop(df[df['num_bedrooms'].isnull()].index)

# convert storey range to mean of range
df['storey_range_feature'] = (df['storey_range'].str[:2].astype(int) + df['storey_range'].str[-2:].astype(int)) / 2

# drop highly correlated features
df = df.drop(['lease_commence_date', 'postal'], axis=1, errors='ignore')


numerical_feature_cols = [
    'num_bedrooms',
    'floor_area_sqm',
    'storey_range_feature',
    'remaining_lease_months',
    'distance_to_mrt',
    'distance_from_center',
    'floor_area_sqm',
    ]


fig, axes = plt.subplots(1, len(numerical_feature_cols), figsize=(16, 6))

for idx, col_name in enumerate(numerical_feature_cols):
    ax = axes[idx]
    sns.boxplot(df[col_name], ax=ax)
    sns.despine(ax=ax)
    ax.set_title(col_name.replace('_', '\n'), loc='left')

fig.tight_layout()

png

# remove outliers
df = df[df['floor_area_sqm'] <= df['floor_area_sqm'].quantile(.95)]
df = df.dropna()
df = df.reset_index(drop=True)


# correlation between numeric variables and target
fig, ax = plt.subplots()
df.drop('resale_price', axis=1) \
    .corrwith(df['resale_price'], numeric_only=True) \
    .sort_values(ascending=True) \
    .plot(kind='barh', ax=ax)
ax.axvline(x=0, c='k')
ax.set_title('Correlation with resale price')
sns.despine(ax=ax)

png

3. Split train-test data

X = df[numerical_feature_cols]
y = df['resale_price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

4. Fit models

4.1 Linear Regresssion (baseline)

I will first fit a simple estimator, which will be used as baseline.

lr_model = LinearRegression()
print(f'fitting {lr_model.__class__.__name__}')
lr_model.fit(X_train, y_train)
preds = lr_model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
print(f'{rmse=:,.0f}')
fitting LinearRegression
rmse=78,280

4.2 Decision Trees, Random Forest, GradientBoosting, SVR

rf_model = RandomForestRegressor(max_depth=20, n_estimators=100)
print(f'fitting {rf_model.__class__.__name__}')
rf_model.fit(X_train, y_train)
preds = rf_model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
print(f'{rmse=:,.0f}')
fitting RandomForestRegressor
rmse=33,531
gb_model = GradientBoostingRegressor()
print(f'fitting {gb_model.__class__.__name__}')
gb_model.fit(X_train, y_train)
preds = gb_model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
print(f'{rmse=:,.0f}')
fitting GradientBoostingRegressor
rmse=60,247
svr_model = LinearSVR()
print(f'fitting {svr_model.__class__.__name__}')
svr_model.fit(X_train, y_train)
preds = svr_model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
print(f'{rmse=:,.0f}')
fitting LinearSVR
rmse=81,687

Making predictions on sample data

To get more specific sense of each model’s performace, we can select random test data and check for each model’s predictions

# select random indices
idx = np.random.choice(X_test.index, size=3)

sample_predictions = {}

for model in [lr_model, rf_model, gb_model, svr_model]:
    sample_predictions[model.__class__.__name__] = model.predict(X_test.loc[idx])

sample_predictions['Ground Truth'] = y_test[idx]
sample_predictions = pd.DataFrame(sample_predictions)

pd.DataFrame(sample_predictions).plot(
    kind='bar',
    rot=0,
    xlabel='flat id',
    ylabel='SGD') \
    .legend(
        loc='center left',
        bbox_to_anchor=(1.0, 0.5))

png

Out of the few models above, RandomForests perform the best. With a RMSE of 33.4k. In real terms, for a property that cost on average close to 500k, being off by 33k is decent.

However, there are a lot more that we can do here: 1. Selecting different models 2. Optimizing hyperparameters, for example, using GridSearch or Hyperopt.

Without mlflow, we will have to manually keep track of model parameteres, data sources, metrics, etc (such as in a google sheet). This is prone to errors and hard to keep track.

5. Summary (so far)

I created basic HDB resale price prediction models with two additional features for centrality and proximity to MRT stations, on top of existing features such as living area, town name, and flat type.

RandomForest outperforms single complex models due to its use of many relatively uncorrelated trees as a committee. Further improvement to predictions can be achieved by adding more relevant features, such as proximity to good schools, shopping malls, and other amenities.

House prices could also be predicted using a Time Series model like ARIMA, among other potential approaches.