In the first two parts of this series, we scraped Toyota Corolla adverts from Gumtree and converted categorical features into numeric types. Before we finally fit our data to a linear model, we'll first look at whether we need to transform any of the features to improve linearity to the target (price).
As an overview, we will
First, read in the csv file we exported in Part 2 of this series. Using the train_test_split() function from the scikit-learn library, we can split the dataset into train/validation/test sets with a 60/20/20 split. As the function can only split to a train/test, we'll first need to do a 60/40 split followed by a 50/50 split on the remainder.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Import the csv file from Part II of this tutorial
corolla = pd.read_csv('corolla_data_cleaned.csv')
# X contains the features, y is the target
X = corolla.drop('Price', axis=1)
y = corolla['Price']
# Train/validation/test split of 60/20/20%
X_train, X_rem, y_train, y_rem = train_test_split(
X, y, train_size=0.6, random_state=10
)
X_valid, X_test, y_valid, y_test = train_test_split(
X_rem, y_rem, train_size=0.5, random_state=10
)
We'll first create a simple baseline model using the age and mileage of the vehicles. Let's have a look at the two features.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.scatterplot(ax=axs[0], x='Age', y='Price', data=corolla)
sns.scatterplot(ax=axs[1], x='Kilometres', y='Price', data=corolla)
plt.show()
Both variables are fairly linear, but it would probably makes more sense to fit an exponential decay curve as the price will never go negative. Nevertheless, we'll fit these to a linear model to get a baseline result.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score,
)
# Fit training set using Age and Mileage
reg = LinearRegression()
reg.fit(X_train[['Age','Kilometres']], y_train)
# Check RMSE and r2 fit on the validation set
pred = reg.predict(X_valid[['Age','Kilometres']])
rmse = mean_squared_error(y_valid, pred) ** (1/2)
r2 = r2_score(y_valid, pred)
print(f'RMSE: {rmse:.2f} \nr2: {r2:.3f}')
The r2 value for the prediction and Xvalid sets is quite good. We can look at the coefficients and intercept of the linear regression model, which have the attributes coef and intercept_.
print(f'Coef (Age): {reg.coef_[0]:.2f} \
\nCoef (km): {reg.coef_[1]:.4f} \
\nIntercept: {reg.intercept_:.2f}')
This simple model says that a used Corolla begins at \$29,585 and loses \$809.46 per year and \$0.048 per kilometre travelled. So a 10 year old corolla with 100,000km on the odometer will be listed as
$$\$29585 - 10*\$809.46 - 100000*\$0.048 = \$16,685$$It's not a bad ballpark calculation, but will begin to break down for older cars or high mileage ones as the price can become negative.
Taking a step back, we'll again plot the age of the vehicle against the price.
We'll also look at a residual plot - this is the difference between the data points and the regression line. On the residual plot we set lowess=True, which refers to a Lowess curve (Locally Weighted Scatterplot Smoothing).
import seaborn as sns
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.regplot(ax=axs[0], x=X_train['Age'], y=y_train,
line_kws={'color': 'red'})
sns.residplot(ax=axs[1], x=X_train['Age'], y=y_train,
lowess=True, line_kws={'color': 'red'})
axs[0].axhline(0, lw=0.5, color='black')
axs[0].set_title('Regression Plot')
axs[1].set_title('Residual Plot')
plt.show()
We can see from the residual plot that lower aged vehicles <5 years and older vehicles >17 years are being underestimated in value. We want the curve on the residual plot to be relatively flat, which would reflect errors that are normally distributed.
Let's repeat the plots and transform the the age by datking the log.
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.regplot(ax=axs[0], x=np.log(X_train['Age']), y=y_train,
line_kws={'color': 'red'})
sns.residplot(ax=axs[1], x=np.log(X_train['Age']), y=y_train,
lowess=True, line_kws={'color': 'red'})
axs[0].axhline(0, lw=0.5, color='black')
axs[0].set_title('Regression Plot')
axs[0].set_xlabel('Log of Age')
axs[1].set_title('Residual Plot')
axs[1].set_xlabel('Log of Age')
plt.show()
Great, this is a much better fit.
We can repeat the same for the kilometres travelled.
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.regplot(ax=axs[0], x=X_train['Kilometres'], y=y_train,
line_kws={'color': 'red'})
sns.residplot(ax=axs[1], x=X_train['Kilometres'], y=y_train,
lowess=True, line_kws={'color': 'red'})
axs[0].axhline(0, lw=0.5, color='black')
axs[0].set_title('Regression Plot')
axs[1].set_title('Residual Plot')
plt.show()
We see a similar residual plot curve as before, let's again try to take the log.
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.regplot(ax=axs[0], x=np.log(X_train['Kilometres']),
y=y_train, line_kws={'color': 'red'})
sns.residplot(ax=axs[1], x=np.log(X_train['Kilometres']),
y=y_train, lowess=True, line_kws={'color': 'red'})
axs[0].set(title='Regression Plot', xlabel='Log of Kilometres')
axs[1].set(title='Residual Plot', xlabel='Log of Kilometres')
plt.show()
Uh oh, we are perhaps worse off. How about the square root?
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.regplot(ax=axs[0], x=np.sqrt(X_train['Kilometres']),
y=y_train, line_kws={'color': 'red'})
sns.residplot(ax=axs[1], x=np.sqrt(X_train['Kilometres']),
y=y_train, lowess=True, line_kws={'color': 'red'})
axs[0].axhline(0, lw=0.5, color='black')
axs[0].set(title='Regression Plot', xlabel='Sqrt of Kilometres')
axs[1].set(title='Residual Plot', xlabel='Sqrt of Kilometres')
plt.show()
This looks to be a much better fit.
We can now repeat fitting the training set to a linear regression model, this time using the transformed data. We'll be taking the log of the age and the square root of the kilometres travelled.
# Currently, X_train is a slice so we need to create
# a copy before applying transformations
X_train = X_train.copy()
X_valid = X_valid.copy()
X_train['Age'] = np.log(X_train['Age'])
X_valid['Age'] = np.log(X_valid['Age'])
X_train['Kilometres'] = np.sqrt(X_train['Kilometres'])
X_valid['Kilometres'] = np.sqrt(X_valid['Kilometres'])
# Fit a linear regression model to the newly transformed data
reg = LinearRegression()
reg.fit(X_train[['Age','Kilometres']], y_train)
pred = reg.predict(X_valid[['Age','Kilometres']])
rmse = mean_squared_error(y_valid, pred) ** (1/2)
r2 = r2_score(y_valid, pred)
print(f'RMSE: {rmse:.2f}')
print(f'r2: {r2:.3f}')
Great, by transforming the data we have improved the r2 value on the validation set from 0.862 to 0.896. The RMSE has been reduced from 2912.75 to 2527.04.
Let's have a look at the coefficients and the intercept of this model.
print(f'Coef (Log of Age): {reg.coef_[0]:.2f} \
\nCoef (Sqrt of Km): {reg.coef_[1]:.2f} \
\nIntercept: {reg.intercept_:.2f}')
Again, we can look at a 10 year old corolla with 100,000km on the odometer, this time we predict it will be worth
$$\$42499.90 - log(10)*\$8389.19 - \sqrt{100000}*\$27.84 = \$14,042$$We'll now use all the features available to us, such as the trim level, transmission, body type etc.
reg = LinearRegression()
reg.fit(X_train, y_train)
pred = reg.predict(X_valid)
rmse = mean_squared_error(y_valid, pred) ** (1/2)
r2 = r2_score(y_valid, pred)
print(f'RMSE: {rmse:.2f}')
print(f'r2: {r2:.3f}')
By adding all the features, we've further improved the accuracy of the model. The r2 has risen from 0.896 to 0.944 and the RMSE has fallen from 2527.04 to 1858.40.
We can also use the model to make a prediction on the test set. Again, we'll need to do a transformation by taking the log of the age and the square root of the kilometres.
X_test = X_test.copy()
X_test['Age'] = np.log(X_test['Age'])
X_test['Kilometres'] = np.sqrt(X_test['Kilometres'])
pred_test = reg.predict(X_test)
rmse = mean_squared_error(y_test, pred_test) ** (1/2)
r2 = r2_score(y_test, pred_test)
print(f'RMSE: {rmse:.2f}')
print(f'r2: {r2:.3f}')
We end up with a similar result between the validation and test set, a good sign to show we didn't overfit the data. We can also do a quick plot between our predicted price and the actual price.
fix, ax = plt.subplots(figsize=(9,5))
sns.regplot(ax=ax, x=pred_test, y=y_test, line_kws={'color': 'red'})
ax.set(xlabel='Predicted Price', ylabel='Actual Price')
plt.show()
Let's take a look at the coefficients and the intercept.
coef_dict = dict(zip(X_train.columns, reg.coef_))
print('Coefficients:')
for k,v in coef_dict.items():
print(f'{k}: {v:.2f}')
print(f'\nIntercept: {reg.intercept_:.2f}')
We can interpret this as a base model is valued at \$35,995, if you want it as a hybrid add on \$4,938 and an extra \$4,237 to get the top of the line trim. Buying it from a dealer - add on \$3,170. This seems an overestimation considering a top of the line hybrid has an RRP of \$34,695. But at the time of writing this we're suffering from inflated prices due to Covid.
It should also be noted the prices are not what the cars sold for, but what the seller is hoping to get for it. There are a lot of dreamers on Gumtree, sellers who think that the car they've modified with new rims, coil-overs and a body kit is now worth 50% more.
Another limitation becomes apparent when considering Hybrid models. While it may attract a premium when new, the batteries will eventually need to be replaced. A new battery can be quite costly, so eventually Hybrid models are sold at a discount. We could perhaps engineer more features or turn to decision trees to account for this.