Level Up Coding

Coding tutorials and news. The developer homepage gitconnected.com && skilled.dev && levelup.dev

Follow publication

Candy Demand Forecasting: A Step-by-Step Guide

Chinonso Nnaji
Level Up Coding
Published in
14 min readDec 20, 2024
Photo by Jason Leung on Unsplash

Introduction

Data Understanding

Hello there!

Data Preparation

# Let's import the required libraries
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For creating visualizations
import seaborn as sns # For advanced visualizations

import warnings # To manage warnings during analysis
warnings.filterwarnings("ignore") # Ignore warnings to keep the notebook clean

# Additional setup to ensure our plots are visible
plt.style.use("ggplot") # Styling plots to be easier on the eyes
sns.set_palette("deep") # Setting a default color palette for better visualization aesthetics

# Let's import the Candy sales data
candy_sales_data = pd.read_csv("Candy_Sales.csv")
# Display the first few rows of the candy sales dataset
candy_sales_data.head()
Image by the author.
# Checking the shape of the dataset to understand the number of rows and columns
candy_sales_data.shape
Image by the author.
# Checking the information about the dataset to understand data types, non-null counts, and memory usage
candy_sales_data.info()
Image by the author.
# Check for duplicate rows in the dataset
duplicate_rows = candy_sales_data.duplicated().sum()

duplicate_rows
Image by the author.
# Check for inconsistent or unusual values in key columns
# Verify data types and ranges for numerical columns
numerical_summary = candy_sales_data[["Sales", "Units", "Gross Profit", "Cost"]].describe()

# Check for unusual shipping modes or inconsistent categorical values
unique_shipping_modes = candy_sales_data["Ship Mode"].unique()
unique_countries = candy_sales_data["Country/Region"].unique()
unique_divisions = candy_sales_data["Division"].unique()

# Summarize findings
numerical_summary, unique_shipping_modes, unique_countries, unique_divisions
Image by the author.
# Convert 'Order Date' to datetime format and ensure the data is ready for aggregation
candy_sales_data["Order Date"] = pd.to_datetime(candy_sales_data["Order Date"])

# Aggregate the sales data by month to prepare for demand forecasting
candy_sales_data["Month"] = candy_sales_data["Order Date"].dt.to_period("M")
monthly_sales_data = candy_sales_data.groupby('Month').agg({
"Sales": "sum", # Summing up sales for each month
"Units": "sum" # Summing up units sold for each month
}).reset_index()

# Convert 'Month' back to a datetime format for easier analysis
monthly_sales_data["Month"] = monthly_sales_data["Month"].dt.to_timestamp()

# Display the prepared data
monthly_sales_data
Image by the author.
Image by the author.

Exploratory Data Analysis (EDA)

Monthly Sales Trend:

# Plot sales trends over time to explore patterns and seasonality
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales_data["Month"], monthly_sales_data["Sales"], marker="o", linestyle="-", label="Sales ($)")
plt.title("Monthly Sales Trend", fontsize=16)
plt.xlabel("Month", fontsize=14)
plt.ylabel("Total Sales ($)", fontsize=14)
plt.grid(True)
plt.legend()
plt.show()
Image by the author.

Monthly Units Sold Trend

# Plot units sold trends over time
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales_data["Month"], monthly_sales_data["Units"], marker="o", linestyle="-", label="Units Sold")
plt.title("Monthly Units Sold Trend", fontsize=16)
plt.xlabel("Month", fontsize=14)
plt.ylabel("Units Sold", fontsize=14)
plt.grid(True)
plt.legend()
plt.show()
Image by the author.

Distribution of Monthly Sales

# Visualization of Sales Distribution
plt.figure(figsize=(10, 6))
plt.hist(monthly_sales_data["Sales"], bins=15, color="skyblue", edgecolor="black")
plt.title("Distribution of Monthly Sales", fontsize=16)
plt.xlabel("Sales ($)", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()
Image by the author.

Units Sold vs. Sales

# Scatter Plot: Units Sold vs Sales to check correlation
plt.figure(figsize=(10, 6))
plt.scatter(monthly_sales_data["Units"], monthly_sales_data["Sales"], color="skyblue", alpha=0.7)
plt.title("Units Sold vs Sales", fontsize=16)
plt.xlabel("Units Sold", fontsize=14)
plt.ylabel("Sales ($)", fontsize=14)
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()
Image by the author.

Feature Engineering

# Extracting time-based features
monthly_sales_data["Year"] = monthly_sales_data["Month"].dt.year
monthly_sales_data["Quarter"] = monthly_sales_data["Month"].dt.quarter
monthly_sales_data["Month_Num"] = monthly_sales_data["Month"].dt.month
monthly_sales_data["Season"] = monthly_sales_data["Month"].dt.month % 12 // 3 + 1 # 1=Winter, 2=Spring, etc.

# Rolling statistics (3-month rolling average for Sales and Units)
monthly_sales_data["Rolling_Sales"] = monthly_sales_data["Sales"].rolling(window=3).mean()
monthly_sales_data["Rolling_Units"] = monthly_sales_data["Units"].rolling(window=3).mean()

# Step 3: Lagged Features (Sales and Units from the previous month)
monthly_sales_data["Lagged_Sales"] = monthly_sales_data["Sales"].shift(1)
monthly_sales_data["Lagged_Units"] = monthly_sales_data["Units"].shift(1)

# Step 4: Normalized Metrics
monthly_sales_data["Sales_per_Unit"] = monthly_sales_data["Sales"] / monthly_sales_data["Units"]

# Display the updated dataset
monthly_sales_data.head()
Image by the author.
# Drop rows with missing values as they occur due to rolling and lagging calculations
# These rows are not useful for model training
cleaned_monthly_sales_data = monthly_sales_data.dropna().reset_index(drop=True)

# Display the cleaned dataset for verification
cleaned_monthly_sales_data.head()
Image by the author.

Model Building

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Define the target variable (Sales) and features
target = "Sales"
features = ["Rolling_Sales", "Lagged_Sales", "Rolling_Units", "Lagged_Units", "Sales_per_Unit",
"Quarter", "Month_Num", "Season"]
# Drop rows with any missing data in the selected columns (as a safety measure)
model_data = cleaned_monthly_sales_data.dropna(subset=[target] + features)
# Split into training and testing sets (80% train, 20% test, keeping time-series order)
train_size = int(len(model_data) * 0.8)
train_data = model_data.iloc[:train_size]
test_data = model_data.iloc[train_size:]

X_train, y_train = train_data[features], train_data[target]
X_test, y_test = test_data[features], test_data[target]
# Train a Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

Model Evaluation

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Display evaluation metrics
mae, rmse
Image by the author.
# Visualize Predicted vs Actual Sales
plt.figure(figsize=(12, 6))
plt.plot(test_data["Month"], y_test, marker="o", label="Actual Sales")
plt.plot(test_data["Month"], y_pred, marker="x", linestyle="--", label="Predicted Sales")
plt.title("Predicted vs Actual Sales", fontsize=16)
plt.xlabel("Month", fontsize=14)
plt.ylabel("Sales ($)", fontsize=14)
plt.legend()
plt.grid(True)
plt.show()
Image by the author.
# Extract feature importance from the trained Random Forest model
feature_importance = model.feature_importances_

# Create a DataFrame for visualization
importance_df = pd.DataFrame({
"Feature": features,
"Importance": feature_importance
}).sort_values(by="Importance", ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df["Feature"], importance_df["Importance"], color="skyblue")
plt.title("Feature Importance in Random Forest Model", fontsize=16)
plt.xlabel("Importance", fontsize=14)
plt.ylabel("Feature", fontsize=14)
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()
Image by the author.
# Prepare data for future sales forecasting
# Using the last row of cleaned data as the base for new predictions
latest_data = cleaned_monthly_sales_data.iloc[-1]

# Generate future periods (e.g., next 6 months)
future_months = pd.date_range(start=latest_data["Month"], periods=7, freq="MS")[1:]

# Create a DataFrame for future predictions
future_data = pd.DataFrame({
"Month": future_months,
"Quarter": future_months.quarter,
"Month_Num": future_months.month,
"Season": future_months.month % 12 // 3 + 1, # Calculate season
"Rolling_Sales": [latest_data["Rolling_Sales"]] * len(future_months),
"Rolling_Units": [latest_data["Rolling_Units"]] * len(future_months),
"Lagged_Sales": [latest_data["Sales"]] * len(future_months),
"Lagged_Units": [latest_data["Units"]] * len(future_months),
"Sales_per_Unit": [latest_data["Sales_per_Unit"]] * len(future_months)
})
# Predict future sales
future_data["Predicted_Sales"] = model.predict(future_data[features])

# Display the future predictions
future_data[["Month", "Predicted_Sales"]].head()
Image by the author.

Insights

Recommendations:

Conclusion

What’s Next?

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Chinonso Nnaji

Data Science Advocate || Data Scientist, I write Articles on Data Science and Machine Learning ||

No responses yet

Write a response