Naive Forecasting

The Naive model is extremely simple - take the last observed value and use this as the prediction. For such a basic model it can prove to be quite powerful. For example, here in Perth, we can forecast whether it is going to rain or not by using a naive model. If it rained today, then we forecast that it will rain tomorrow, and vice versa. Using this model we end up being correct about 80% of the time.

MON	TUE	WED	THU	FRI	SAT	SUN	FORECAST (MON)
🌧️	🌧️	☀️	☀️	☀️	🌧️	🌧️	🌧️

First we will explore a naive forecast and then see if we can improve our predictive powers using a seasonal naive forecast.

M5 Dataset

We'll use the M5 dataset hosted by Kaggle to run through a naive forecast. The M5 dataset contains the sales figures for Walmart for 1919 days and the goal of the Kaggle competition is to accurately guess the following 28 days of sales. We'll first read in the sales figures using Pandas.

import numpy as np
import pandas as pd

sales = pd.read_csv('data/sales_train_validation.csv')
print(sales.shape)
sales.head()

(30490, 1919)

The first six columns are identifiers for each product - a unique store-product id, item id, and then which department, category, store and state it belongs to. Following that are the unit sales recorded for 1913 days for the 30490 products.

Implementing a Naive Model

A Naive model takes the last observed value and uses it as the forecast. So to start off with, we'll grab the last column of the sales dataframe.

naive = sales.iloc[:,-1]

Easy.

Measuring the Error

We've got the values we're going to use for the forecast, we need a metric to gauge how accurate we are. Kaggle uses the WRMSSE for the error measurement, which is a weighted and scaled RMSE. You can find out more about it in the WRMSSE tutorial.

For now, we'll go along with submitting it to Kaggle. But later we will use the m5-wrmsse package as it will be a lot quicker.

Let's have a peek at the sample submission file that Kaggle has supplied.

submission_file = pd.read_csv('data/sample_submission.csv')
submission_file

The competition requires that we forecast 28 days of sales. You may notice that the submission file has 60,980 rows - this is twice as many rows as the sales dataframe. It may be confusing if you're starting this after the competition has already has closed. The competition was conducted in 2 stages - in the first stage only the validation set was available. Later on, the evaluation set also became available.

The top 30,490 rows of the submission file are for the 28-day forecast using the sales_train_validiation.csv dataset. The validation set contains days 1 to 1913, so the forecast will be for days 1914 to 1941.

The bottom 30,490 rows of the submission file are for the 28-day forecast using the sales_train_evaluation.csv dataset. The evaluation set contains days 1 to 1941, so the forecast will be for days 1942 to 1969.

To get a score on the Public Leaderboard, you only need to fill in the top half of the submission file. The bottom half can be left as zeros for now. So let's make a copy of the submission_sample dataframe and fill in the top half with our prediction from the naive model.

submission_naive = submission_file.copy()
submission_naive.iloc[:30490,1:] = np.array([naive]*28).T
submission_naive.to_csv('submission_naive.csv', index=False)

After you submit this to Kaggle, you should get an WRMSSE score of 1.46378 on the Public Leaderboard.

We'll see if we can improve on this using a seasonal naive approach.

Seasonal Naive

The naive forecast we created takes the very last day of sales to use as the forecast. We can try to improve on this by taking into account seasonal fluctuations in the data. Now, seasonal doesn't necessarily mean summer, autumn, winter and spring. It can mean a weekly pattern, or a daily pattern. Maybe a store is always busy on Saturday, but quiet on Monday.

Let's see if there's a trend throughout the week. We'll find the total unit sales for each day of our data and then combine it with information in calendar.csv so we know which weekday it corresponds to.

calendar = pd.read_csv('data/calendar.csv')
calendar.head()

total_sales = sales.filter(like='d_', axis=1).sum()
total_sales = pd.DataFrame(total_sales, columns=['total_sales'])
total_sales = total_sales.reset_index().rename(columns={'index': 'd'})
total_sales = calendar[['weekday','month','d']].merge(total_sales, how='left', on='d')
total_sales = total_sales.dropna()
total_sales.head()

Let's use Seaborn to do a boxplot, using the weekday as a categorical variable.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='weekday', y='total_sales', data=total_sales)
plt.show()

We can see there is a definite trend - Saturday and Sunday have significantly higher unit sales on average, which then fall to a low during mid-week.

Let's also take a look at the unit sales by month.

sns.boxplot(x='month', y='total_sales', data=total_sales)
plt.show()

There isn't as much of a definitive trend here. Some industries will have big swings throughout the year, for example an air conditioning company will normally see an uptick during the summer and winter months.

Testing Different Seasonal Naive Models

We will push through with using a 7-day seasonal cycle. There are a few options, the first will involve taking the last 7 days of sales and then repeating it 4 times to obtain the 28-day forecast. We'll also try using the last month of sales as well as the sales from the same month in the previous year.

# Forecasting Horizon
h = 28

# Last Week
naive_lw = pd.concat([sales.iloc[:,-7:]]*4, axis=1, ignore_index=True)

# Last Month
naive_lm = sales.iloc[:,-h:]

# Last Year
naive_ly = sales.iloc[:,-13*h:-12*h]

We have created three different 28-day forecasts which we can now compare the error. You can either submit these using Kaggle, or use the m5-wrmsse package outlined in the WRMSSE tutorial.

from m5_wrmsse import wrmsse

wrmsse_lw = wrmsse(naive_lw.values)
wrmsse_lm = wrmsse(naive_lm.values)
wrmsse_ly = wrmsse(naive_ly.values)

print('WRMSSE (Last Week): %.3f\
      \nWRMSSE (Last Month): %.3f\
      \nWRMSSE (Last Year): %.3f' % 
      (wrmsse_lw, wrmsse_lm, wrmsse_ly))

WRMSSE (Last Week): 0.870      
WRMSSE (Last Month): 0.838      
WRMSSE (Last Year): 1.294

Using the previous 4-week period as the forecast produced the lowest WRMSSE, followed closely by using just the previous week. The error increased dramatically when forecasting with the same month from the previous year - most likely due to product lines changing in that time. A store like Walmart would be continually changing the products it sells, so it will be difficult to make an accurate forecast based on last years' sales figures. However, all three seasonal naive models were an improvement over using just the final day as the forecast (WRMSSE=1.46).

The naive or seasonal naive forecast is often used as a baseline, it's a very simple technique and often produces quite a good result. More complex forecasting techniques can be compared to the baseline to see how much of an improvement can be made, but generally this comes with increased computation time. We'll dig deeper into the M5 dataset in future tutorials on the Adaptations of Croston's Method, ADIDA and Intermittent Demand and Multiple Temperal Aggregation.

	id	item_id	dept_id	cat_id	store_id	state_id	...	d_1904	d_1905	d_1906	d_1907	d_1908	d_1909	d_1910	d_1911	d_1912	d_1913
0	HOBBIES_1_001_CA_1_validation	HOBBIES_1_001	HOBBIES_1	HOBBIES	CA_1	CA	...	1	3	0	1	1	1	3	0	1	1
1	HOBBIES_1_002_CA_1_validation	HOBBIES_1_002	HOBBIES_1	HOBBIES	CA_1	CA	...	0	0	0	0	0	1	0	0	0	0
2	HOBBIES_1_003_CA_1_validation	HOBBIES_1_003	HOBBIES_1	HOBBIES	CA_1	CA	...	2	1	2	1	1	1	0	1	1	1
3	HOBBIES_1_004_CA_1_validation	HOBBIES_1_004	HOBBIES_1	HOBBIES	CA_1	CA	...	1	0	5	4	1	0	1	3	7	2
4	HOBBIES_1_005_CA_1_validation	HOBBIES_1_005	HOBBIES_1	HOBBIES	CA_1	CA	...	2	1	1	0	1	1	2	2	2	4

	id	F1	F2	F3	F4	F5	F6	F7	F8	F9	...	F19	F20	F21	F22	F23	F24	F25	F26	F27	F28
0	HOBBIES_1_001_CA_1_validation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	HOBBIES_1_002_CA_1_validation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	HOBBIES_1_003_CA_1_validation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	HOBBIES_1_004_CA_1_validation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	HOBBIES_1_005_CA_1_validation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
60975	FOODS_3_823_WI_3_evaluation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
60976	FOODS_3_824_WI_3_evaluation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
60977	FOODS_3_825_WI_3_evaluation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
60978	FOODS_3_826_WI_3_evaluation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
60979	FOODS_3_827_WI_3_evaluation	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	date	wm_yr_wk	weekday	wday	month	year	d	event_name_1	event_type_1	event_name_2	event_type_2	snap_CA	snap_TX	snap_WI
0	2011-01-29	11101	Saturday	1	1	2011	d_1	NaN	NaN	NaN	NaN	0	0	0
1	2011-01-30	11101	Sunday	2	1	2011	d_2	NaN	NaN	NaN	NaN	0	0	0
2	2011-01-31	11101	Monday	3	1	2011	d_3	NaN	NaN	NaN	NaN	0	0	0
3	2011-02-01	11101	Tuesday	4	2	2011	d_4	NaN	NaN	NaN	NaN	1	1	0
4	2011-02-02	11101	Wednesday	5	2	2011	d_5	NaN	NaN	NaN	NaN	1	0	1

	weekday	month	d	total_sales
0	Saturday	1	d_1	32631.0
1	Sunday	1	d_2	31749.0
2	Monday	1	d_3	23783.0
3	Tuesday	2	d_4	25412.0
4	Wednesday	2	d_5	19146.0