Create a Demand Sales Forecast Model and Minimize the Error

Federico Riveroll
Level Up Coding
Published in
9 min readOct 8, 2020

--

A step-by-step tutorial in Python.

Minimize cost Function

Sales or Demand Forecasts are a priority on a huge amount of companies (from startups to global corporations) Data Science/Analytics departments. To say the least, there is a low supply of experts in the subject. Reducing the error even by a small amount can make a huge difference in revenue or savings.

In this article, we’ll do a simple sales forecast model with real data and then improve it by finding relevant features using Python.

What we’ll do

  • Step 1: Define and understand Data and Target
  • Step 2: Make a Simple Forecast Model
  • Step 3: Improve it by adding New and Financial Indicators
  • Step 4: Analyze Results

Step 1. Define and understand Data and Target

For this article, we’ll use real weekly sales data provided by Walmart.

Walmart released data containing weekly sales for 99 departments (clothing, electronics, food…) in every physical store along with some other added features.

Walmart dataset screenshot

For this, we will create an ML model with ‘Weekly_Sales’ as target, and train with the first 70% observations and test on the posterior 30%.

The objective is to minimize the Prediction error on future weekly sales.

We’ll add external variables that impact or have a relationship with sales such as dollar index, oil price and news about Walmart.

Step 2. Make a Simple Forecast Model

First, you need to have Python 3 installed and the following libraries:

$ pip install pandas OpenBlender scikit-learn

Then, open a Python script (preferably Jupyter notebook) and let’s import the needed libraries.

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import OpenBlender
import json

Now, let’s define the methodology and model to be used on all experiments.

First, the date range of the data is from Jan 2010 to Dec 2012. Let’s define the first 70% of the data used for training and the posterior 30% for testing (because we don’t want data leakage on our predictions).

Next, let’s define as our standard model a RandomForestRegressor with 50 estimators, which is a reasonably good option.

Finally, to keep things as simple as possible, let’s define the error as the absolute sum of errors.

Now, let’s put it in a Python class.

class StandardModel:

model = RandomForestRegressor(n_estimators=50, criterion='mse')

def train(self, df, target):
# Drop non numerics
df = df.dropna(axis=1).select_dtypes(['number'])

# Create train/test sets
X = df.loc[:, df.columns != target].values
y = df.loc[:,[target]].values

# We take the first bit of the data as test and the
# last as train because the data is ordered desc.
div = int(round(len(X) * 0.29))
X_train = X[div:]
y_train = y[div:]
print('Train Shape:')
print(X_train.shape)
print(y_train.shape)
#Random forest model specification
self.model = RandomForestRegressor(n_estimators=50)
# Train on data
self.model.fit(X_train, y_train.ravel())


def getMetrics(self, df, target):

# Function to get the error sum from the trained model
# Drop non numerics
df = df.dropna(axis=1).select_dtypes(['number'])

# Create train/test sets
X = df.loc[:, df.columns != target].values
y = df.loc[:,[target]].values
div = int(round(len(X) * 0.29))
X_test = X[:div]
y_test = y[:div]
print('Test Shape:')
print(X_test.shape)
print(y_test.shape)

# Predict on test
y_pred_random = self.model.predict(X_test)

# Gather absolute error
error_sum = sum(abs(y_test.ravel() - y_pred_random))
return error_sum

Above we have an object with 3 elements:

  • model (RandomForestRegressor)
  • train: A function to train that model with a dataframe and a target
  • getMetrics: A function to test with the trained model with the test data and retrieve the error

We will use this configuration for all of the experiments, although you can modify it as you want to test different models, optimize parameters, feature engineering or whatever else. The value added will remain and could potentially improve.

Now, let’s get the Walmart data. You can download ‘walmartData.csv’ here.

df_walmart = pd.read_csv('walmartData.csv')
print(df_walmart.shape)
df_walmart.head()

There are 421, 570 observations. As we said before, the observations are registers of weekly sales by store per department.

Let’s plug the data into the model without tampering with it at all.

our_model = StandardModel()
our_model.train(df_walmart, 'Weekly_Sales')
total_error_sum = our_model.getMetrics(df_walmart, 'Weekly_Sales')
print("Error sum: " + str(total_error_sum))
> Error sum: 967705992.5034052

The sum of all errors for the complete model is $ 967,705,992.5 USD from all the predictions vs. the real sales.

This doesn’t mean much by itself, the only reference is that the sum of all sales in that period is $ 6,737,218,987.11 USD.

Since there is a whole lot of data, we will only focus on Store #1 for this tutorial, but the methodology is absolutely replicable for all stores.

Let’s take a look at the error generated by Store 1 alone.

# Select store 1
df_walmart_st1 = df_walmart[df_walmart['Store'] == 1]
# Error of store 1
error_sum_st1 = our_model.getMetrics(df_walmart_st1, 'Weekly_Sales')
print("Error sum: " + str(error_sum_st1))
# > Error sum: 24009404.060399983

So, Store 1 is responsible for the $24,009,404.06 USD error and this will be our threshold for comparison.

Now let’s break down the error by department to have more visibility later.

error_summary = []
for i in range(1,100):
try:
df_dept = df_walmart_st1[df_walmart_st1['Dept'] == i]
error_sum = our_model.getMetrics(df_dept, 'Weekly_Sales')
print("Error dept : " + str(i) + ' is: ' + str(error_sum))
error_summary.append({'dept' : i, 'error_sum_normal_model' : error_sum})
except:
error_sum = 0
print('No obs for Dept: ' + str(i))
error_summary_df = pd.DataFrame(error_summary)
error_summary_df.head()

Now we have a dataframe with the errors corresponding to each department on Store 1 with our threshold model.

Let’s improve these numbers.

Step 3. Improve by adding News and Financial Indicators

Lets select department 1 to make a simple example.

df_st1dept1 = df_walmart_st1[df_walmart_st1['Dept'] == 1]

Now, lets search for intersected datasets.

# First we need to add the UNIX timestamp which is the number 
# of seconds since 1970 on UTC, it is a very convenient
# format because it is the same in every time zone in the world!
df_st1dept1['timestamp'] = OpenBlender.dateToUnix(df_st1dept1['Date'],
date_format = '%Y-%m-%d',
timezone = 'GMT')
df_st1dept1 = df_st1dept1.sort_values('timestamp').reset_index(drop = True)

Now, let’s search in OpenBlender for time intersected (overlapped) datasets about ‘business’ or ‘walmart’.

Note: To get a token you need have to create an account on openblender.io (free), you’ll find it in the ‘Account’ tab on your profile icon.

token = 'YOUR_TOKEN_HERE'print('From : ' + OpenBlender.unixToDate(min(df_st1dept1.timestamp)))
print('Until: ' + OpenBlender.unixToDate(max(df_st1dept1.timestamp)))
# Now, let's search on OpenBlender
search_keyword = 'business walmart'
# We need to pass our timestamp column and
# search keywords as parameters.
OpenBlender.searchTimeBlends(token,
df_st1dept1.timestamp,
search_keyword)

The search found several datasets. We can see name, description, url, features, and most importantly, time interesction with ours so we can blend them to our dataset.

Let’s start by blending this Walmart Tweets dataset and look for promos.

  • Note: I picked this one because it makes sense, but you can search for hundreds of other ones.

We can blend new columns to our dataset by searching terms on texts or news aggregated by time. For instance, we could create a ‘promo’ feature with the number of mentions which will match our self made ngrams:

text_filter = {'name' : 'promo', 
'match_ngrams': ['promo', 'dicount', 'cut', 'markdown','deduction']}
# blend_source needs the id_dataset and the name of the feature.blend_source = {
'id_dataset':'5e1deeda9516290a00c5f8f6',
'feature' : 'text',
'filter_text' : text_filter
}
df_blend = OpenBlender.timeBlend( token = token,
anchor_ts = df_st1dept1.timestamp,
blend_source = blend_source,
blend_type = 'agg_in_intervals',
interval_size = 60 * 60 * 24 * 7,
direction = 'time_prior',
interval_output = 'list')
df_anchor = pd.concat([df_st1dept1, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

The parameters for the timeBlend function (you can find the documentation here):

  • anchor_ts: We only need to send our timestamp column so that it can be used as an anchor to blend the external data.
  • blend_source: The information about the feature we want.
  • blend_type: ‘agg_in_intervals’ because we want 1 week interval aggregation to each of our observations.
  • inverval_size: The size of the interval in seconds (24 * 7 hours in this case).
  • direction: ‘time_prior’ because we want the interval to gather observations from the prior 7 days and not forward to avoid data leakage.

We now have our original dataset but with 2 new columns, the ‘COUNT’ of our ‘promo’ feature and a list of the actual texts in case one wants to iterate through each one.

df_anchor.tail()

So now we have a numerical feature about how many times our ngrams were mentioned. We could probably do better ngrams if we knew which store or department corresponds to ‘1’ (Walmart didn’t share that).

Let’s apply the Standard Model and compare the error vs. the original.

our_model.train(df_anchor, 'Weekly_Sales')
error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales')
error_sum
#> 253875.30

The current model had a $253, 975 error while the previous one had $290,037. That’s a 12% improvement.

But this doesn’t prove much, it could be that the RandomForest got lucky. After all, the original model trained with over 299K observations. The current one only training with 102!!

We can also blend numerical features. Let’s try blending Dollar index, Oil price, and Monthly Consumer Sentiment

# OIL PRICEblend_source = {
'id_dataset':'5e91045a9516297827b8f5b1',
'feature' : 'price'
}
df_blend = OpenBlender.timeBlend( token = token,
anchor_ts = df_anchor.timestamp,
blend_source = blend_source,
blend_type = 'agg_in_intervals',
interval_size = 60 * 60 * 24 * 7,
direction = 'time_prior',
interval_output = 'avg')
df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)# DOLLAR INDEXblend_source = {
'id_dataset':'5e91029d9516297827b8f08c',
'feature' : 'price'
}
df_blend = OpenBlender.timeBlend( token = token,
anchor_ts = df_anchor.timestamp,
blend_source = blend_source,
blend_type = 'agg_in_intervals',
interval_size = 60 * 60 * 24 * 7,
direction = 'time_prior',
interval_output = 'avg')
df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)# CONSUMER SENTIMENTblend_source = {
'id_dataset':'5e979cf195162963e9c9853f',
'feature' : 'umcsent'
}
df_blend = OpenBlender.timeBlend( token = token,
anchor_ts = df_anchor.timestamp,
blend_source = blend_source,
blend_type = 'agg_in_intervals',
interval_size = 60 * 60 * 24 * 7,
direction = 'time_prior',
interval_output = 'avg')
df_anchor = pd.concat([df_anchor, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)df_anchor

Now we have 6 more features, the average of oil index, dollar index and consumer sentiment on the 7 day intervals and the count of each (which for this case is irrelevant).

Let’s run that model again.

our_model.train(df_anchor, 'Weekly_Sales')
error_sum = our_model.getMetrics(df_anchor, 'Weekly_Sales')
error_sum
>223831.9414

Now, we’re down to a $223,831 error. That’s a 24.1% improvement with repect to the original $290,037 !!

Step 4. Analyze Results

To see if results are consistent, the above was run for the first 10 departments on every store (which i’m not adding the code because it’s too big a script). But it is literally perfoming the above on every department (first 10) in every store.

I separated the expermient by each added feature to measure the error reduction with each one:

0: Original Error

1: Walmart promo

2: Oil Price

3: Dollar Index

4: Consumer Sentiment

Departments 4 and 6 are on a higher order than the rest. Let’s remove them to take a closer look at the rest.

We can see that almost on all departments the error lowered as we added new features. We can also see that the Oil Index (third feature) was not only not helpful but even harmful for some departments.

In Sales Forecasting reducing the error on even a fraction of a percent can mean huge revenue. The 3 main drivers to improve are: feature engineering , model and parameter optimization and blending useful features.

--

--

M.S. D.S. & Mathematics | Co-founder @OpenBlender | Master's Professor of Data Science