Kaggle's M5 forecasting competition uses a weighted adaptation of the RMSSE as its measure of accuracy. This guide will establish why each series is weighted, how the WRMSSE is calculated, and a Python package for calculating it locally - instead of having to upload a submission file to Kaggle.
Recall that the RMSSE allows us to compare the accuracy of series at different scales. The WRMSSE builds further on this by assigning each RMSSE value with a weight - but why?
Consider a car dealership that also sells hanging air fresheners over the counter. Now, let's say we have a forecasting model that can accurately predict the sale of air fresheners, but doesn't have much predictive power for the sale of cars. This model isn't going to be of much use to the business as the majority of their revenue is going to be coming from the sale of cars. To help tune the forecasting model, we can assign a weights to each series that reflect their importance. High value items can have large weights so that the error is amplified - a big red warning that we're on the wrong track.
As a brief overview, the M5 WRMSSE uses a weighted score that considers the cumalitive dollar amount for each product - the higher the sales revenue, the more important it becomes. Additionally, there are 12 levels with which the data is aggregated over. For example, one level looks at the total revenue for each store and assigns a weight accordingly (a business will be more interested in accurately predicting sales for a high revenue mega-store, than one of it's tiny corner shops).
It will become clearer as we work through this, but at any point you can refer to the M5 competitors guide if you need a more detailed explanation.
The M5 dataset contains the sale of 30,490 products tracked over 1,941 days. We'll need to use the sales_train_evaluation.csv file and split it into a train/test set. Let's read the csv file to see what we're working with.
import pandas as pd
import numpy as np
sales = pd.read_csv('data/sales_train_evaluation.csv')
sales.head()
The first columns are categorical information; item identification and the state, store, department etc it belongs to. The remaining columns are sales for each day, beginning on day 1 (d_1) and finishing on day 1941 (d_1941).
We will split the daily sales into a train/test split. The forecast will be simple; use the final 28-days of sales from the train set (naive forecast).
h = 28 # forecasting horizon
train = sales.iloc[:,:-h]
test = sales.iloc[:,-h:]
forecast = train.iloc[:,-h:] # 28-day naive forecast
As a quick sanity check, we can have a look at the dimensions of the train, test and forecast sets.
print('Training set: %s\nTest set: %s\nForecast: %s' %
(train.filter(like='d_').shape, test.shape, forecast.shape))
As expected, we have a training set with 30,490 products with sales for 1913 days. The test and forecast sets are both 28 days.
We can now calculate the RMSSE. The rmsse() function below is taken from the RMSSE tutorial, which will return the RMSSE for each row.
def rmsse(train, test, forecast):
forecast_mse = np.mean((test - forecast)**2, axis=1)
train_mse = [np.mean((np.diff(np.trim_zeros(row))**2)) for row in train]
return np.sqrt(forecast_mse / train_mse)
rmsse_vals = rmsse(train.filter(like="d_").values, test.values, forecast.values)
print(rmsse_vals)
print('Number of values: %d' % len(rmsse_vals))
print('RMSSE mean: %.3f' % np.mean(rmsse_vals))
We have 30,490 RMSSE values, one for each product/row in the dataset. Out of interest, if we take the mean of these values we end up with a value of 1.032.
However, to calculate the WRMSSE as per the M5 competitors guide, we perform
$$WRMSSE = \sum_{i=1}^{42,840} w_i * RMSSE$$From the formula, we need to apply a weight to each RMSSE value and sum over the 42,840 series... but we only have 30,490! So before we are able to calculate the WRMSSE, we first need to aggregate the dataset over 12 different levels.
The below table is taken from the M5 guide, which shows the 12 levels of aggregation we need to perform.
Level | Aggregation | Number of series |
---|---|---|
1 | Unit sales of all products, aggregated for all stores/states | 1 |
2 | Unit sales of all products, aggregated for each State | 3 |
3 | Unit sales of all products, aggregated for each store | 10 |
4 | Unit sales of all products, aggregated for each category | 3 |
5 | Unit sales of all products, aggregated for each department | 7 |
6 | Unit sales of all products, aggregated for each State and category | 9 |
7 | Unit sales of all products, aggregated for each State and department | 21 |
8 | Unit sales of all products, aggregated for each store and category | 30 |
9 | Unit sales of all products, aggregated for each store and department | 70 |
10 | Unit sales of product x, aggregated for all stores/states | 3,049 |
11 | Unit sales of product x, aggregated for each State | 9,147 |
12 | Unit sales of product x, aggregated for each store | 30,490 |
Total | 42,840 |
The 30,490 series need to be aggregated 12 different ways, based on categorical information to get a total of 42,840 series. For example, level 1 is the summation of all product sales across all stores, so we end up with one final series representing total sales. In level 2, we group the data by the state first. Since there are three states, we end up with three additional time series. And so on.
Let's look at the training set again.
train.head()
We have the categorical info required for the aggregation stored in the first few columns (item_id, dept_id, cat_id etc). We'll set this up in a dictionary where the keys are the level numbers and the values are how they are going to be grouped.
agg_levels = {
'level_1': None,
'level_2': ['state_id'],
'level_3': ['store_id'],
'level_4': ['cat_id'],
'level_5': ['dept_id'],
'level_6': ['state_id', 'cat_id'],
'level_7': ['state_id', 'dept_id'],
'level_8': ['store_id', 'cat_id'],
'level_9': ['store_id', 'dept_id'],
'level_10': ['item_id'],
'level_11': ['item_id', 'state_id'],
'level_12': ['id']
}
We need to iterate over each level, group by the category and then perform a summation. The first level is None, meaning we won't be grouping at all, so it will be a total aggregation.
def aggregation(df):
df_agg = pd.DataFrame([])
for v in agg_levels.values():
if v is None:
agg = df.sum(numeric_only=True)
agg = pd.DataFrame([agg], index=['Total'])
else:
agg = df.groupby(by=v).sum()
df_agg = pd.concat([df_agg, agg])
return df_agg
The training set can simply be passed to the aggregation() function.
train_agg = aggregation(train)
train_agg.head()
The first row is the total summation (level 1), followed by three rows of where the data has been grouped by state (level 2). We can check that we have the 42,840 required rows.
train_agg.shape
We can repeat this aggregation process on the test and forecast sets. First we need to insert the categorical columns and then pass them to the aggregation() function.
# Concatenate with categorical columns
test = pd.concat((sales.loc[:,:'state_id'],test), axis=1)
forecast = pd.concat((sales.loc[:,:'state_id'],forecast), axis=1)
# Perform aggregation over the 12 levels
test_agg = aggregation(test)
forecast_agg = aggregation(forecast)
# Calculate RMSSE with aggregated datasets
rmsse_vals_agg = rmsse(train_agg.values,
test_agg.values,
forecast_agg.values)
print(rmsse_vals_agg)
print('Number of values: %d' % len(rmsse_vals_agg))
print('RMSSE mean: %.3f' % np.mean(rmsse_vals_agg))
Great, we have 42,840 RMSSE values now. Let's have a look at the WRMSSE formula again.
$$WRMSSE = \sum_{i=1}^{42,840} w_i * RMSSE$$All we need to do is multiply each RMSSE by an appropriate weight, $w$.
The weighting for each row is going to be based on the proportion of total revenue. Large weights will be assigned to high revenue products/series, as a business will be more concerned with accurately predicting those. Going back to the example of a car yard, a larger weight would be applied to the error in car sales. A low weight would be applied to the sales of air fresheners.
To be able to calculate the weights, we need to know the revenue. At the moment, we only have unit sales. Each unit sale will need to be multiplied by the corresponding sale price. We need the data stored in calendar.csv and sell_prices.csv. We'll also go back to the original 30,490 series in sales_train_evaluation.csv.
calendar = pd.read_csv('data/calendar.csv')
sales = pd.read_csv('data/sales_train_evaluation.csv')
prices = pd.read_csv('data/sell_prices.csv')
First, let's grab the unit sales for last month and assign it to sales_lm. The final 28 days in sales_train_evaluation.csv is the test set, so we need the 28 days prior to this period.
h = 28
sales_lm = sales.iloc[:,-2*h:-h].set_index(sales['id'])
sales_lm.head()
We have the sales figures for each day, now we need to find their corresponding price. We can get this information from the combining information stored in the prices and calendar files.
prices.head()
calendar.head()
We have to do a bit of a match up here. The sell price for each product is given by the week, wm_yr_wk. We can find which days correspond to which weeks in the calendar file. For example, day d_1 belongs to wm_yr_wk 11101.
We're only concerned with the days d_1886 to d_1913, which are the column names in sales_lm. So we'll take a slice of the calendar dataframe and assign it to wm_d.
wm_d = calendar[calendar['d'].isin(sales_lm.columns)]
wm_d[['wm_yr_wk','d']].head()
Likewise, we can take a slice of the prices dataframe so we're only working with the weeks that correspond to d_1886 to d_1913.
prices_lm = prices[prices['wm_yr_wk'].isin(wm_d['wm_yr_wk'].unique())]
prices_lm.head()
We can take the prices_lm dataframe, merge it wm_d to get the days, then pivot the table to make the days shift axis to columns and finally merge with the sales dataframe to the id index.
prices_lm = (prices_lm
.merge(wm_d, how='left', on='wm_yr_wk')
.pivot_table(index=['store_id','item_id'],
columns='d',values='sell_price')
.merge(sales[['id','item_id', 'store_id']],
how='left', on=['item_id', 'store_id'])
.set_index('id')
.drop(['item_id','store_id'], axis=1)
.reindex(index=sales_lm.index)
)
prices_lm.head()
Great, we now have the sell prices for each product for each day.
The next step is to do an element-wise multiplication of sales_lm and prices_lm to get the revenue on each day, followed by a row-wise summation to get the total revenue for each id.
revenue_lm = (sales_lm * prices_lm).sum(axis=1)
revenue_lm.head()
The weights are going to be the percentage of total revenue, so we take each row and divide it by the sum of all rows.
revenue_pct = revenue_lm / revenue_lm.sum()
revenue_pct.head()
The last step will be to aggregate the weights over the 12 levels. To do this, we need to merge the revenue_pct dataframe with the categorical data (item_id, state_id, dept_id etc).
revenue_pct = (revenue_pct
.to_frame(name='revenue_pct')
.reset_index()
.merge(sales.loc[:,:'state_id'], how='left', on='id')
)
revenue_pct.head()
Now we can perform the aggregation. Each level is going to sum to 1, so we set the weight of the first level to be equal to 1. For the remaining levels, we group by the categorical data and do a summation.
weights = pd.Series([], dtype='float64')
for v in agg_levels.values():
if v is None:
w = pd.Series([1])
else:
w = revenue_pct.groupby(by=v).sum()['revenue_pct']
weights = pd.concat([weights,w])
weights.head()
We've done 12 levels of aggregation with each level summing to a weight of 1. We need to divide the weights by 12.
weights = weights / 12
print('Number of weights: %s' % len(weights))
print('Sum of all weights: %.5f' % weights.sum())
We now have 42,840 weights and we previously calculated 42,840 RMSSE values. Almost there!
$$WRMSSE = \sum_{i=1}^{42,840} w_i * RMSSE$$All that's left to do is multiply the weights and RMSSE arrays and do a final sum.
wrmsse = (weights * rmsse_vals_agg).sum()
print('WRMSSE: %.3f' % wrmsse)
There it is, a final WRMSSE of 0.838 for the 28-day naive forecast.
I've put together a package to make life easier, the GitHub project can be found here. You can also install it using pip:
pip install m5-wrmsse
Once installed, you can import the wrmsse function.
from m5_wrmsse import wrmsse
The wrmsse function takes a 28-day forecast, which will be a numpy array of shape (30490, 28). We can test this on a naive forecast using the sales_train_validation.csv file.
sales = pd.read_csv('data/sales_train_validation.csv')
naive_forecast = sales.iloc[:,-h:].values
print(naive_forecast.shape)
We can now pass the naive_forecast array to the wrmsse function to evaluate the score.
wrmsse(naive_forecast)
Easy!