Used Car Price Model (Part 2) - Cleaning and Feature Extraction

The satisfaction of cleaning a car is the same as cleaning a dataset. And who knows, one day in the future the guy at the traffic lights washing windscreens will be replaced by an out-of-work coder that will clean your dataset for 75 crypto-bucks.

In Part 1 of this series, we scraped Toyota Corolla adverts from Gumtree and saved them to a csv file. We'll now clean up the dataset by removing outliers and fixing any incorrectly entered information, followed by extracting additional features within the data.

Load the Dataset

First, we'll import Numpy, Pandas and the data from Part 1.

import numpy as np
import pandas as pd

# Import the csv file from Part I of this tutorial series
corolla = pd.read_csv('corolla_data.csv')

Let's have a look at the first entry in the dataset, which is the first advert that was scraped.

corolla.iloc[0,:]
Ad ID                                                          1293184000
Ad Title                                         Toyota Corolla 2011 Auto
Price                                                              $7,500
Description             Up for sale Toyota Corolla 2011 in perfect wor...
State                                                                  WA
Area                                                         Bunbury Area
Suburb                                                            Bunbury
Date Listed:                                                   05/04/2022
Last Edited:                                                   05/04/2022
Seller Type:                                               Private seller
Make:                                                              Toyota
Model:                                                            Corolla
Variant:                                                           ASCENT
Body Type:                                                          Sedan
Year:                                                                2011
Kilometres:                                                        181550
Transmission:                                                   Automatic
Drive Train:                                            Front Wheel Drive
Fuel Type:                                              Petrol - Unleaded
Colour:                                                              Blue
Air Conditioning:                                                     Yes
VIN:                                                    JTNBU56E40J061094
Registered:                                                           Yes
Registration Number:                                              1CRB103
Registration Expiry:                                1 months - 31/05/2022
Stock Number:                                                         NaN
Name: 0, dtype: object

We'll do a few quick fixes, such as cleaning up the column names and converting the price to a numeric type we can work with.

# Remove the semi-colons in the column names
corolla.rename(lambda x: x.replace(':',''), axis=1, inplace=True)

# Change the price column to integer type
corolla['Price'].replace(r'[\$,]', '', regex=True, inplace=True)
corolla['Price'] = corolla['Price'].astype(int)

# Convert the year of manufacture to the age of the vehicle
corolla['Age'] = 2022 - corolla['Year']
corolla.drop('Year', axis=1, inplace=True)

Removing Outliers

When searching for a used car, the focus is usually on the age of the vehicle and the mileage... or kilometrage here in metric land. Let's plot both against the price.

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.scatterplot(ax=axs[0], x='Age', y='Price', data=corolla)
sns.scatterplot(ax=axs[1], x='Kilometres', y='Price', data=corolla)
plt.show()

There are a few suspicious data points here - very low kilometres travelled with a low sell price. We can slice the dataframe and have a closer look.

c = corolla[(corolla['Kilometres'] < 50000) & (corolla['Price'] < 10000)]
c[['Kilometres','Age', 'Description']]
Kilometres Age Description
2 26700 21 2001 Toyota Corolla conquest, only 26700kms, A...
34 192 29 Paint is a little faded, but other than that t...
71 138 11 Low kilometres 138 new aircond motor and gas a...
247 13000 12 Looking for car, must be in excellent conditio...

The first car appears to have only travelled 26,700km in 21 years, which is extremely low. Upon checking the advert, it is indeed legitimate. The next two entries are cases where the seller has entered 192 instead of 192,000 and 138 instead of 138,000. The final entry is a want-to-buy advert, which we will drop. While we're at it, will also drop unregistered vehicles as these are generally write-offs.

# Fix up where 192 = 192000 and 138 = 138000
corolla.loc[34,'Kilometres'] = 192000
corolla.loc[71,'Kilometres'] = 138000

# Fill in some other missing data
corolla.loc[29,'Body Type'] = 'Hatchback'
corolla.loc[29,'Transmission'] = 'Automatic'

# Drop WTB adverts and unregistered cars
corolla.drop(247, inplace=True)
corolla = corolla[corolla['Registered']=='Yes'].copy()
corolla.reset_index(inplace=True)
corolla.drop('index', axis=1, inplace=True)

Feature Extraction

There are a few extra features we can pull out from this dataset. First, we'll see if the car is a Hybrid.

corolla['Hybrid'] = (
    corolla['Variant']
    .apply(lambda x: 1 if 'HYBRID' in x.upper() else 0)
)

We can also check if the car has been well-maintained, the discription generally contains the phrase 'full service history'.

corolla['Service History'] = (
    corolla['Description']
    .apply(lambda x: 1 if 'FULL SERVICE HISTORY' in x.upper() else 0)
)

We'll categorise the different Corolla variants into trim levels - base, mid, top of the range and the sport options.

trim = {
    'Mid': ['CONQUEST', 'ASCENT SPORT'],
    'Base': ['CSI', 'ASCENT'],
    'Top': ['ULTIMA', 'ZR', 'CSX', 'CS-X'],
    'Sport': ['SPORTIVO', 'SX', 'LEVIN', 'RZ'],
}

def get_trim(variant):
    for k,v in trim.items():
        for model in v:
            if model in variant.upper():
                return k
    return 'Other'

corolla['Trim'] = corolla['Variant'].apply(get_trim)

Let's plot the newly created trim levels along with the age and price of the vehicles.

fig, axs = plt.subplots(figsize=(9,5))
sns.scatterplot(ax=axs, x='Age', y='Price', hue='Trim', data=corolla)
plt.show()

As expected, the higher level trim vehicles generally command a higher price than the lower/base models. To be able to use this categorical data in a linear model, we'll need to convert these to dummy variables.

features = ['Price', 'Seller Type', 'Body Type', 'Kilometres', 
         'Transmission', 'Age', 'Hybrid', 'Service History', 
         'Trim']
corolla = corolla[features]

# Convert to dummy variables
dummy_list = ['Seller Type', 'Body Type', 'Transmission', 'Trim']
dummies = pd.get_dummies(corolla[dummy_list], drop_first=True)
corolla = pd.concat([corolla.drop(dummy_list, axis=1), dummies], axis=1)
corolla.head()
Price Kilometres Age Hybrid Service History Seller Type_Dealer used Seller Type_Private seller Body Type_Sedan Body Type_Wagon Transmission_Manual Trim_Mid Trim_Other Trim_Sport Trim_Top
0 7500 181550 11 0 1 0 1 1 0 0 0 0 0 0
1 10490 136560 11 0 0 1 0 1 0 1 1 0 0 0
2 12000 207000 10 0 0 0 1 0 0 0 0 0 0 0
3 21990 76159 7 0 0 1 0 1 0 1 0 0 1 0
4 19888 69205 6 0 0 1 0 0 0 0 0 0 0 0

We're now ready to begin constructing our predictive model. We'll save the dataframe as a csv file and continue this in Part 3.

corolla.to_csv('corolla_data_cleaned.csv', index=False)