The satisfaction of cleaning a car is the same as cleaning a dataset. And who knows, one day in the future the guy at the traffic lights washing windscreens will be replaced by an out-of-work coder that will clean your dataset for 75 crypto-bucks.
In Part 1 of this series, we scraped Toyota Corolla adverts from Gumtree and saved them to a csv file. We'll now clean up the dataset by removing outliers and fixing any incorrectly entered information, followed by extracting additional features within the data.
First, we'll import Numpy, Pandas and the data from Part 1.
import numpy as np
import pandas as pd
# Import the csv file from Part I of this tutorial series
corolla = pd.read_csv('corolla_data.csv')
Let's have a look at the first entry in the dataset, which is the first advert that was scraped.
corolla.iloc[0,:]
We'll do a few quick fixes, such as cleaning up the column names and converting the price to a numeric type we can work with.
# Remove the semi-colons in the column names
corolla.rename(lambda x: x.replace(':',''), axis=1, inplace=True)
# Change the price column to integer type
corolla['Price'].replace(r'[\$,]', '', regex=True, inplace=True)
corolla['Price'] = corolla['Price'].astype(int)
# Convert the year of manufacture to the age of the vehicle
corolla['Age'] = 2022 - corolla['Year']
corolla.drop('Year', axis=1, inplace=True)
When searching for a used car, the focus is usually on the age of the vehicle and the mileage... or kilometrage here in metric land. Let's plot both against the price.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(1, 2, sharey=False, figsize=(14,4))
sns.scatterplot(ax=axs[0], x='Age', y='Price', data=corolla)
sns.scatterplot(ax=axs[1], x='Kilometres', y='Price', data=corolla)
plt.show()
There are a few suspicious data points here - very low kilometres travelled with a low sell price. We can slice the dataframe and have a closer look.
c = corolla[(corolla['Kilometres'] < 50000) & (corolla['Price'] < 10000)]
c[['Kilometres','Age', 'Description']]
The first car appears to have only travelled 26,700km in 21 years, which is extremely low. Upon checking the advert, it is indeed legitimate. The next two entries are cases where the seller has entered 192 instead of 192,000 and 138 instead of 138,000. The final entry is a want-to-buy advert, which we will drop. While we're at it, will also drop unregistered vehicles as these are generally write-offs.
# Fix up where 192 = 192000 and 138 = 138000
corolla.loc[34,'Kilometres'] = 192000
corolla.loc[71,'Kilometres'] = 138000
# Fill in some other missing data
corolla.loc[29,'Body Type'] = 'Hatchback'
corolla.loc[29,'Transmission'] = 'Automatic'
# Drop WTB adverts and unregistered cars
corolla.drop(247, inplace=True)
corolla = corolla[corolla['Registered']=='Yes'].copy()
corolla.reset_index(inplace=True)
corolla.drop('index', axis=1, inplace=True)
There are a few extra features we can pull out from this dataset. First, we'll see if the car is a Hybrid.
corolla['Hybrid'] = (
corolla['Variant']
.apply(lambda x: 1 if 'HYBRID' in x.upper() else 0)
)
We can also check if the car has been well-maintained, the discription generally contains the phrase 'full service history'.
corolla['Service History'] = (
corolla['Description']
.apply(lambda x: 1 if 'FULL SERVICE HISTORY' in x.upper() else 0)
)
We'll categorise the different Corolla variants into trim levels - base, mid, top of the range and the sport options.
trim = {
'Mid': ['CONQUEST', 'ASCENT SPORT'],
'Base': ['CSI', 'ASCENT'],
'Top': ['ULTIMA', 'ZR', 'CSX', 'CS-X'],
'Sport': ['SPORTIVO', 'SX', 'LEVIN', 'RZ'],
}
def get_trim(variant):
for k,v in trim.items():
for model in v:
if model in variant.upper():
return k
return 'Other'
corolla['Trim'] = corolla['Variant'].apply(get_trim)
Let's plot the newly created trim levels along with the age and price of the vehicles.
fig, axs = plt.subplots(figsize=(9,5))
sns.scatterplot(ax=axs, x='Age', y='Price', hue='Trim', data=corolla)
plt.show()
As expected, the higher level trim vehicles generally command a higher price than the lower/base models. To be able to use this categorical data in a linear model, we'll need to convert these to dummy variables.
features = ['Price', 'Seller Type', 'Body Type', 'Kilometres',
'Transmission', 'Age', 'Hybrid', 'Service History',
'Trim']
corolla = corolla[features]
# Convert to dummy variables
dummy_list = ['Seller Type', 'Body Type', 'Transmission', 'Trim']
dummies = pd.get_dummies(corolla[dummy_list], drop_first=True)
corolla = pd.concat([corolla.drop(dummy_list, axis=1), dummies], axis=1)
corolla.head()
We're now ready to begin constructing our predictive model. We'll save the dataframe as a csv file and continue this in Part 3.
corolla.to_csv('corolla_data_cleaned.csv', index=False)