Naive Forecasting

The Naive model is extremely simple - take the last observed value and use this as the prediction. For such a basic model it can prove to be quite powerful. For example, here in Perth, we can forecast whether it is going to rain or not by using a naive model. If it rained today, then we forecast that it will rain tomorrow, and vice versa. Using this model we end up being correct about 80% of the time.

MON TUE WED THU FRI SAT SUN FORECAST (MON)
🌧️ 🌧️ ☀️ ☀️ ☀️ 🌧️ 🌧️ 🌧️

First we will explore a naive forecast and then see if we can improve our predictive powers using a seasonal naive forecast.

M5 Dataset

We'll use the M5 dataset hosted by Kaggle to run through a naive forecast. The M5 dataset contains the sales figures for Walmart for 1919 days and the goal of the Kaggle competition is to accurately guess the following 28 days of sales. We'll first read in the sales figures using Pandas.

import numpy as np
import pandas as pd

sales = pd.read_csv('data/sales_train_validation.csv')
print(sales.shape)
sales.head()
(30490, 1919)
id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4 ... d_1904 d_1905 d_1906 d_1907 d_1908 d_1909 d_1910 d_1911 d_1912 d_1913
0 HOBBIES_1_001_CA_1_validation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 1 3 0 1 1 1 3 0 1 1
1 HOBBIES_1_002_CA_1_validation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
2 HOBBIES_1_003_CA_1_validation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 2 1 2 1 1 1 0 1 1 1
3 HOBBIES_1_004_CA_1_validation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 1 0 5 4 1 0 1 3 7 2
4 HOBBIES_1_005_CA_1_validation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0 ... 2 1 1 0 1 1 2 2 2 4

5 rows × 1919 columns

The first six columns are identifiers for each product - a unique store-product id, item id, and then which department, category, store and state it belongs to. Following that are the unit sales recorded for 1913 days for the 30490 products.

Implementing a Naive Model

A Naive model takes the last observed value and uses it as the forecast. So to start off with, we'll grab the last column of the sales dataframe.

naive = sales.iloc[:,-1]

Easy.

Measuring the Error

We've got the values we're going to use for the forecast, we need a metric to gauge how accurate we are. Kaggle uses the WRMSSE for the error measurement, which is a weighted and scaled RMSE. You can find out more about it in the WRMSSE tutorial.

For now, we'll go along with submitting it to Kaggle. But later we will use the m5-wrmsse package as it will be a lot quicker.

Let's have a peek at the sample submission file that Kaggle has supplied.

submission_file = pd.read_csv('data/sample_submission.csv')
submission_file
id F1 F2 F3 F4 F5 F6 F7 F8 F9 ... F19 F20 F21 F22 F23 F24 F25 F26 F27 F28
0 HOBBIES_1_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 HOBBIES_1_002_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 HOBBIES_1_003_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 HOBBIES_1_004_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 HOBBIES_1_005_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
60975 FOODS_3_823_WI_3_evaluation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
60976 FOODS_3_824_WI_3_evaluation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
60977 FOODS_3_825_WI_3_evaluation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
60978 FOODS_3_826_WI_3_evaluation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
60979 FOODS_3_827_WI_3_evaluation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

60980 rows × 29 columns

The competition requires that we forecast 28 days of sales. You may notice that the submission file has 60,980 rows - this is twice as many rows as the sales dataframe. It may be confusing if you're starting this after the competition has already has closed. The competition was conducted in 2 stages - in the first stage only the validation set was available. Later on, the evaluation set also became available.

The top 30,490 rows of the submission file are for the 28-day forecast using the sales_train_validiation.csv dataset. The validation set contains days 1 to 1913, so the forecast will be for days 1914 to 1941.

The bottom 30,490 rows of the submission file are for the 28-day forecast using the sales_train_evaluation.csv dataset. The evaluation set contains days 1 to 1941, so the forecast will be for days 1942 to 1969.

To get a score on the Public Leaderboard, you only need to fill in the top half of the submission file. The bottom half can be left as zeros for now. So let's make a copy of the submission_sample dataframe and fill in the top half with our prediction from the naive model.

submission_naive = submission_file.copy()
submission_naive.iloc[:30490,1:] = np.array([naive]*28).T
submission_naive.to_csv('submission_naive.csv', index=False)

After you submit this to Kaggle, you should get an WRMSSE score of 1.46378 on the Public Leaderboard.

We'll see if we can improve on this using a seasonal naive approach.

Seasonal Naive

The naive forecast we created takes the very last day of sales to use as the forecast. We can try to improve on this by taking into account seasonal fluctuations in the data. Now, seasonal doesn't necessarily mean summer, autumn, winter and spring. It can mean a weekly pattern, or a daily pattern. Maybe a store is always busy on Saturday, but quiet on Monday.

Let's see if there's a trend throughout the week. We'll find the total unit sales for each day of our data and then combine it with information in calendar.csv so we know which weekday it corresponds to.

calendar = pd.read_csv('data/calendar.csv')
calendar.head()
date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
0 2011-01-29 11101 Saturday 1 1 2011 d_1 NaN NaN NaN NaN 0 0 0
1 2011-01-30 11101 Sunday 2 1 2011 d_2 NaN NaN NaN NaN 0 0 0
2 2011-01-31 11101 Monday 3 1 2011 d_3 NaN NaN NaN NaN 0 0 0
3 2011-02-01 11101 Tuesday 4 2 2011 d_4 NaN NaN NaN NaN 1 1 0
4 2011-02-02 11101 Wednesday 5 2 2011 d_5 NaN NaN NaN NaN 1 0 1
total_sales = sales.filter(like='d_', axis=1).sum()
total_sales = pd.DataFrame(total_sales, columns=['total_sales'])
total_sales = total_sales.reset_index().rename(columns={'index': 'd'})
total_sales = calendar[['weekday','month','d']].merge(total_sales, how='left', on='d')
total_sales = total_sales.dropna()
total_sales.head()
weekday month d total_sales
0 Saturday 1 d_1 32631.0
1 Sunday 1 d_2 31749.0
2 Monday 1 d_3 23783.0
3 Tuesday 2 d_4 25412.0
4 Wednesday 2 d_5 19146.0

Let's use Seaborn to do a boxplot, using the weekday as a categorical variable.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='weekday', y='total_sales', data=total_sales)
plt.show()

We can see there is a definite trend - Saturday and Sunday have significantly higher unit sales on average, which then fall to a low during mid-week.

Let's also take a look at the unit sales by month.

sns.boxplot(x='month', y='total_sales', data=total_sales)
plt.show()

There isn't as much of a definitive trend here. Some industries will have big swings throughout the year, for example an air conditioning company will normally see an uptick during the summer and winter months.

Testing Different Seasonal Naive Models

We will push through with using a 7-day seasonal cycle. There are a few options, the first will involve taking the last 7 days of sales and then repeating it 4 times to obtain the 28-day forecast. We'll also try using the last month of sales as well as the sales from the same month in the previous year.

# Forecasting Horizon
h = 28

# Last Week
naive_lw = pd.concat([sales.iloc[:,-7:]]*4, axis=1, ignore_index=True)

# Last Month
naive_lm = sales.iloc[:,-h:]

# Last Year
naive_ly = sales.iloc[:,-13*h:-12*h]

We have created three different 28-day forecasts which we can now compare the error. You can either submit these using Kaggle, or use the m5-wrmsse package outlined in the WRMSSE tutorial.

from m5_wrmsse import wrmsse

wrmsse_lw = wrmsse(naive_lw.values)
wrmsse_lm = wrmsse(naive_lm.values)
wrmsse_ly = wrmsse(naive_ly.values)

print('WRMSSE (Last Week): %.3f\
      \nWRMSSE (Last Month): %.3f\
      \nWRMSSE (Last Year): %.3f' % 
      (wrmsse_lw, wrmsse_lm, wrmsse_ly))
WRMSSE (Last Week): 0.870      
WRMSSE (Last Month): 0.838      
WRMSSE (Last Year): 1.294

Using the previous 4-week period as the forecast produced the lowest WRMSSE, followed closely by using just the previous week. The error increased dramatically when forecasting with the same month from the previous year - most likely due to product lines changing in that time. A store like Walmart would be continually changing the products it sells, so it will be difficult to make an accurate forecast based on last years' sales figures. However, all three seasonal naive models were an improvement over using just the final day as the forecast (WRMSSE=1.46).

The naive or seasonal naive forecast is often used as a baseline, it's a very simple technique and often produces quite a good result. More complex forecasting techniques can be compared to the baseline to see how much of an improvement can be made, but generally this comes with increased computation time. We'll dig deeper into the M5 dataset in future tutorials on the Adaptations of Croston's Method, ADIDA and Intermittent Demand and Multiple Temperal Aggregation.