Finding seasonality in sales data
An intermediate topic in SQL with the cosmos
“Mars is there, waiting to be reached.” — BUZZ ALDRIN
Last time in LAG functionality on SQL Data, we imagined traveling to Mars and visiting the base of (one of many mountains), Olympus Mons where a food stand was selling astronaut ice cream. In the example from the post, I used imaginary data to showcase the LAG
window function with the sales
SQL data table.
This post is going to use the LAG
as well as the LEAD
window functions to solve an algorithmic problem I came across on LeetCode. I will introduce LEAD
functionality, as well as solve the problem: display the records in the sales
table which have three or more consecutive sales values equal to or greater than a given number. Let’s begin.
Updating the data
Recall the data we used last time had five records of net_sales
; each net sale for a month that the food stand was open. The table, called sales
, looked like this:
For the purpose of this blog, I am going to add more data — specifically, ten more records. This represents ten additional months where the net_sales
of the food stand were recorded. To do that, I use INSERT
:
INSERT INTO sales
VALUES (6, 9343),
(7, 9123),
(8, 8770),
(9, 6780),
(10, 4002),
(11, 4999),
(12, 5673),
(13, 6002),
(14, 7109),
(15, 7994);
A quick view of the table (with SELECT
) shows fifteen records, each representing a month, with values for net_sales
:
Perfect. Now we have more records to query.
LEAD window function
The LEAD
window function is similar to the LAG
function. Recall that with the LAG
function we can access a row at a specific offset that comes before the current row. Let’s review an example:
Suppose I want to look at the net sales from the previous month while at the same time viewing the current month. If the current row is the 4th
month
and thenet_sales
value is 7319, then with an offset of1
, I expect to see 5618 in the next column for the 4th row.
Similarly, with the LEAD
window function, we can access a row at a specific offset that comes after the current row. Similar to the previous example:
Suppose I want to look at the net sales from the following month while at the same time viewing the current month. If the current row is the 4th
month
and thenet_sales
value is 7319, then with an offset of1
, I expect to see 9282 in the next column for the 4th row.
Applying this to the first five rows of sales
:
The Problem
Now we are equipped with some handy tricks to solve the problem where the owner of the food stand wants to know some information about the seasonality of her sales. Specifically, she wants to know if there are three or more consecutive months that have sale values equal to or greater than 7500.
There are a handful of methods to solve this problem, but here is a solution using LAG
and LEAD
window functions:
SELECT
s.months
, s.net_sales
FROM
(SELECT months
, net_sales
, LAG(net_sales, 2) OVER(ORDER BY months) before_month
, LAG(net_sales, 1) OVER(ORDER BY months) last_month
-- current month sales is net_sales
, LEAD(net_sales, 1) OVER(ORDER BY months) second_month
, LEAD(net_sales, 2) OVER(ORDER BY months) third_month
FROM sales) s
WHERE (s.net_sales>=7500 AND s.second_month>=7500 AND s.third_month>=7500)
OR (s.before_month>=7500 AND s.last_month>=7500 AND s.net_sales>=7500)
OR (s.last_month>=7500 AND s.net_sales>=7500 AND s.second_month>=7500)
Wow — that’s a complex query. Let’s break it down.
First, notice the subquery:
(SELECT months
, net_sales
, LAG(net_sales, 2) OVER(ORDER BY months) before_month
, LAG(net_sales, 1) OVER(ORDER BY months) last_month
-- current month sales is net_sales
, LEAD(net_sales, 1) OVER(ORDER BY months) second_month
, LEAD(net_sales, 2) OVER(ORDER BY months) third_month
FROM sales) s
- In it, we are creating a new view that has new columns defined by
LAG
andLEAD
. - The first five rows of this temporary table are:
Second, notice the WHERE
clause:
WHERE (s.net_sales>=7500 AND s.second_month>=7500 AND s.third_month>=7500)
OR (s.before_month>=7500 AND s.last_month>=7500 AND s.net_sales>=7500)
OR (s.last_month>=7500 AND s.net_sales>=7500 AND s.second_month>=7500)
- This recognizes that there are three situations that could have “three consecutive months.”
- First situation: Where the current record for
net_sales
is the 1st month and two months of high sales follow. - Second situation: Where the current record for
net_sales
is the 3rd month and two months of high sales occur before it. - Third and final situation: Where the current record for
net_sales
is the 2nd month and a month of high sales occur before it and another month of high sales occurs after.
Putting it all together, we are querying the table created by the the subquery, s
, and looking for three or more consecutive rows that have three values greater than 7500.
Result:
So, it seems that months 5 through (and including) 8 have peak sales.
That’s it for this fake data, folks! Another addition to your SQL toolbox.
For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!
- Black holes, planets, and SQL
- Touring the Solar System with SQL
- Searching for moons with SQL
- Classifying black holes with SQL
- Joining Constellations with SQL
- Solar System classifications with SQL
- Many (to many) relations among the stars
- Creating tables in SQL
- A window to the solar system
Like these SQL posts? Have a question? Let’s chat in the comments!