0. Introduction

  • The Vietnamese high school graduation exam, also known as the National High School Exam, is a standardized test taken by high school students in Vietnam. It is used to determine whether students have achieved a sufficient level of knowledge and skills in order to graduate from high school and is considered a crucial factor in their future academic and career prospects. The test covers subjects such as mathematics, literature, physics, chemistry, biology, history, and geography.

  • The scores are public, and is semi-anonymised. From this dataset, I will:

    1. Identify general trends across the subjects
    2. Predict missing scores
    3. Answer whether it’s fair to give bonus scores to students in less developed areas
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
pd.set_option('display.max_rows', 100)
sns.set_theme(style='white')

df = pd.read_csv('data/processed/yr2019.csv') # main score dataset
gdp = pd.read_csv('data/processed/grdp_per_cap.csv') # to find correlation between income level and academic performance

df.head()
candidate_ID literature mathematics fl_code foreign_language physics chemistry biology history geography civics_study province
0 1000029 5.00 7.2 N1 6.8 6.0 4.5 4.00 NaN NaN NaN Ha Noi
1 1000030 6.25 6.2 N1 8.0 5.0 4.5 3.75 NaN NaN NaN Ha Noi
2 1000031 5.75 6.8 N1 7.2 NaN NaN NaN 4.25 5.5 7.0 Ha Noi
3 1000032 4.50 5.8 N1 9.0 NaN NaN NaN 7.00 6.5 7.0 Ha Noi
4 1000033 5.50 7.0 N1 3.6 NaN NaN NaN 5.25 7.5 8.5 Ha Noi
subjects = [
 'literature',
 'mathematics',
 'foreign_language',
 
 'physics',
 'chemistry',
 'biology',
 
 'history',
 'geography',
 'civics_study',]

1. Overview of subjects and scores

df_long = pd.melt(
    df,
    id_vars=['candidate_ID', 'province'],
    value_vars=subjects,
    var_name='subject',
    value_name='score'
).dropna()
df_long.head()
candidate_ID province subject score
0 1000029 Ha Noi literature 5.00
1 1000030 Ha Noi literature 6.25
2 1000031 Ha Noi literature 5.75
3 1000032 Ha Noi literature 4.50
4 1000033 Ha Noi literature 5.50
fig1, axes1 = plt.subplots(1, 2, figsize=(12, 6), sharey=True)

# count of students
subject_count = df[subjects].notnull().sum()

sns.barplot(
   x=subject_count.values,
   y=subject_count.index,
   palette="vlag",
   ax=axes1[0],
)

# distribution of scores per subject
sns.boxplot(
    x=df_long['score'],
    y=df_long['subject'],
    showfliers=False,
    palette="vlag",
    ax=axes1[1],
    showmeans=True,
    meanprops={"marker":"o",
                       "markerfacecolor":"white", 
                       "markeredgecolor":"black",
                      "markersize":"8"})
                      
for ax in axes1:
    sns.despine(ax=ax, left=True, bottom=True)

    for val in range(len(subjects)):
        ax.axhline(y=val + .5, color='grey', ls='--', alpha=.5)

axes1[0].set_title('Number of students signing up for each subject', loc='left', fontweight="bold")
axes1[0].set_xlabel('Number of students')
axes1[1].set_title('Distribution of scores obtained\nCircles show averages', loc='left', fontweight="bold")
axes1[1].set_ylabel(None)
axes1[1].set_xlabel('Score out of 10')

fig1.tight_layout()

png

Subject choices:

  • The compulsory subjects are Maths and Literature, and for most students, a Foreign Language. Among these subjects, students struggle the most with Foreign Language where the median student scored less than half of total points.
  • The Natural Science subjects (Physics, Chemistry, Biology) are less popular than Social Sciences (History, Geography, Civics Study)

Scores:

  • Maths scores spread out the most. This does a good job of ranking students. Furthermore, Maths’ means and median are very close. Nice!
  • Civics study has the highest average score. On one hand, it’s hard to ask difficult questions on this subject as it mostly concern with how one can be a ‘good’ citizen. On the other hand, it might not be a good subject to evaluate students. I would recommend a simple Pass/Fail score for this subject.

2. Finding correlation between subjects and Predicting missing scores

  • Generally, we can assume that a student who do well in a subject (such as Maths) might likely also do well in other (eg. Physics or Chemistry). As the skills required (logical thinking, being hardworking) are similar.
  • We can use this knowledge to make predictions. For example: predict Maths scores based on other subjects.

  • Beyond simple predictions, we can find more subtle patterns. In extreme cases, we might even identify cheating students if their scores deviate too much from expected. However, this is a serious topic that require additional data (eg. students’ and their schools’ performance across time, etc) so we will not try to do so here based solely on a snapshot of limited data.

Correlation matrix

scores = df[subjects]
scores.head()
literature mathematics foreign_language physics chemistry biology history geography civics_study
0 5.00 7.2 6.8 6.0 4.5 4.00 NaN NaN NaN
1 6.25 6.2 8.0 5.0 4.5 3.75 NaN NaN NaN
2 5.75 6.8 7.2 NaN NaN NaN 4.25 5.5 7.0
3 4.50 5.8 9.0 NaN NaN NaN 7.00 6.5 7.0
4 5.50 7.0 3.6 NaN NaN NaN 5.25 7.5 8.5
corr_matrix = scores.corr()
corr_matrix.round(2)
literature mathematics foreign_language physics chemistry biology history geography civics_study
literature 1.00 0.48 0.41 0.24 0.25 0.27 0.45 0.49 0.46
mathematics 0.48 1.00 0.57 0.67 0.62 0.38 0.41 0.48 0.44
foreign_language 0.41 0.57 1.00 0.39 0.19 0.25 0.33 0.34 0.32
physics 0.24 0.67 0.39 1.00 0.52 0.20 0.18 0.22 0.12
chemistry 0.25 0.62 0.19 0.52 1.00 0.45 0.17 0.19 0.10
biology 0.27 0.38 0.25 0.20 0.45 1.00 0.26 0.30 0.23
history 0.45 0.41 0.33 0.18 0.17 0.26 1.00 0.60 0.50
geography 0.49 0.48 0.34 0.22 0.19 0.30 0.60 1.00 0.58
civics_study 0.46 0.44 0.32 0.12 0.10 0.23 0.50 0.58 1.00
# remove the top triangle of the matrix to simplify the heatmap
mask = np.zeros_like(corr_matrix, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# plot
fig2, axes2 = plt.subplots(figsize=(8, 6))

sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    linewidth=.5,
    cmap = sns.color_palette("vlag_r", as_cmap=True),
    ax=axes2
    )

axes2.set_title('Score correlation between subjects:\nHigh correlations among Maths-Physics-Chemistry scores', loc='left', fontweight="bold")
Text(0.0, 1.0, 'Score correlation between subjects:\nHigh correlations among Maths-Physics-Chemistry scores')

png

Predicting Math score based on other subjects

We will use xgboost regression as it works well with tabular data

scores_nonnull = scores.dropna(subset='mathematics')

X = scores_nonnull.drop('mathematics', axis=1)
y = scores_nonnull['mathematics']


# Splitting
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 123)
  

# Instantiation
xgb_r = xg.XGBRegressor(objective ='reg:squarederror', n_estimators = 10, seed = 123)
  

# Fitting the model
xgb_r.fit(train_X, train_y)

 
# Predict the model
pred = xgb_r.predict(test_X)

# RMSE Computation
rmse = np.sqrt(MSE(test_y, pred))
print(f'RMSE : {rmse:.3f}')

RMSE : 0.998

Using the fitted model, we can predict Math scores for those who didn’t take the test.

# predicting scores of students who dropped maths
missed_maths = df.loc[df['mathematics'].isnull()][subjects].drop(['mathematics'], axis=1)

missed_maths_pred = pd.DataFrame({'predicted': xgb_r.predict(missed_maths)})

missed_maths_pred.head()
predicted
0 6.494553
1 5.275107
2 3.323812
3 5.175710
4 2.715799
actual_mean = scores_nonnull['mathematics'].mean()
dropped_mean = missed_maths_pred['predicted'].mean()
fig3, axes3 = plt.subplots(figsize=(8,6))

sns.kdeplot(
   data=missed_maths_pred, x="predicted",
   fill=True,
   alpha=.5, linewidth=0,
   label='predicted scores for\nstudents who dropped maths',
   ax=axes3
)

sns.kdeplot(
   x=scores_nonnull['mathematics'],
   fill=True,
   alpha=.5, linewidth=0,
   label='actual scores',
   ax=axes3
)

sns.despine(ax=axes3, left=True, bottom=True)
axes3.set_title('Students who dropped Maths are predicted to score lower\n than those who took the test. Most notably some having scores <=1', loc='left', fontweight="bold")
axes3.set_xlabel('Score')

axes3.axvline(x=dropped_mean, ls='--', color=sns.color_palette().as_hex()[0])
axes3.text(x=dropped_mean, y=0.4, ha='right', s=f'dropped out avg: {dropped_mean:.2f} ', color=sns.color_palette().as_hex()[0])
axes3.axvline(x=actual_mean, ls='--', color=sns.color_palette().as_hex()[1])
axes3.text(x=actual_mean, y=0.4, ha='left', s=f' actual avg: {actual_mean:.2f} ', color=sns.color_palette().as_hex()[1])
axes3.set_ylim(top=.45)

axes3.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x17fb30ee0>

png

3. Is it fair to give bonus points for less developed regions?

  • Students from less developed regions are given bonus scores. Reason is that they can devote less time to study compare to their peers in wealthier regions. A bonus score gives them with better opportunity for a good univerity education and improve their prospects.
  • This practice is only fair if students from poorer areas indeed perform worse.
  • We can validate this by comparing score distributions between 2 regions (similar to an A/B test). In this example I compare between Ha Noi (~5,000 USD per cap) and Nghe An (~1,500 USD per cap)
# avg score per province
prov = df.pivot_table(
    index='province',
    values=subjects,
    aggfunc=np.mean
).reset_index()

prov = prov.merge(gdp, how='left', left_on='province', right_on='provinces')
prov.head().round(2)
province biology chemistry civics_study foreign_language geography history literature mathematics physics provinces grdp (million usd) population grdp/cap
0 An Giang 5.08 5.49 7.92 4.71 6.44 4.84 5.92 5.89 5.63 An Giang 3226.8 1908352 1690.88
1 Ba Ria Vung Tau 4.60 5.34 7.66 5.11 6.05 4.30 5.34 6.11 5.66 Ba Ria Vung Tau 6496.1 1148313 5657.08
2 Bac Giang 4.84 5.62 7.57 4.07 6.17 4.43 5.53 5.45 5.88 Bac Giang 3772.7 1803950 2091.36
3 Bac Kan 4.92 5.22 7.52 3.70 6.10 4.54 5.24 4.39 5.17 Bac Kan 427.2 313905 1360.92
4 Bac Lieu 5.22 5.48 7.86 4.33 6.33 4.77 6.02 5.86 5.63 Bac Lieu 1638.2 907236 1805.70
fig4, axes4 = plt.subplots(figsize=(8, 6))

sns.scatterplot(
    data=prov,
    x='grdp/cap',
    y='mathematics',
    size='population',
    legend=False,
    color='#3c76aa',
    alpha=.75,
    ax=axes4
)

axes4.axvspan(1000, 2500, color='#fcba03', alpha=.25, zorder=-1, ec='None')


axes4.set_title('Strong positive correlation between GDP per cap and Maths score\nup until 2,500 USD/person level', loc='left', fontweight="bold")

sns.despine(ax=axes4, left=True, bottom=True)

png

  • From 0 to around 2,500 USD/cap, we see a very strong positive relation between GRDP and Maths score. But above 2,500 USD/cap the trend is flattende.
  • This implies that after a certain standard of living is achived, score is less impacted by wealth. At this point, basic education costs (books/tuitions) are covered. And other factors impact more.
print(f"Ha Noi avg Maths: {df[(df['province']=='Ha Noi')]['mathematics'].mean():.2f}")
print(f"Nghe An avg Maths: {df[(df['province']=='Nghe An')]['mathematics'].mean():.2f}")
Ha Noi avg Maths: 6.03
Nghe An avg Maths: 5.41

Hypothesis testing of 2 population means

  • Null hypothesis: 2 provinces have identical averages
  • Alternative hypothesis: 2 provinces have different averages
hanoi_math = df[(df['province']=='Ha Noi')]['mathematics'].dropna()
nghean_math = df[(df['province']=='Nghe An')]['mathematics'].dropna()

stats.ttest_ind(
    a=hanoi_math,
    b=nghean_math,
    alternative='greater',
    )
Ttest_indResult(statistic=50.43617963341244, pvalue=0.0)

Since p-value < 0.05, we reject the null hypothesis. Our sample data favor that students in Ha Noi indeed perform better than Nghe An.

At the moment, it still make sense to support students from less developed regions. However, as shown in the chart, upon reaching certain GRDP level, students’ scores stop improving. Therefore, in order to ensure fairness, it will require policymaker to monitor the pace of development and make adjustments accordingly.

References

  1. Bonus scores based on region. https://huongnghiep.hocmai.vn/diem-uu-tien-la-gi-tat-tan-tat-ve-cach-tinh-diem-uu-tien-xet-tuyen-dai-hoc/