How to solve the Apriori algorithm in a simple way from scratch?

Melanee Group
Level Up Coding
Published in
8 min readJun 19, 2022
Image source

Note: All the contents of the images, including tables and calculations and codes have been investigated by me and there is no need to refer any references for them.

Introduction

There are several methods for machine learning such as association, correlation, classification & clustering, this tutorial primarily focuses on learning using association rules. By association rules, we identify the set of items or attributes that occur together in a table[1].

Association Rule Learning

The association rule learning is one of the very important concepts of machine learning, and it is employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market basket analysis is a technique used by the various big retailer to discover the associations between items. We can understand it by taking an example of a supermarket, as in a supermarket, all products that are purchased together are put together[2].

Association rule learning can be divided into three types of algorithms[2]:

  1. Apriori
  2. Eclat
  3. F-P Growth Algorithm

Introduction to APRIORI

Apriori is an algorithm used for Association Rule learning. It searches for a series of frequent sets of items in the datasets. It builds on associations and correlations between the itemsets. It is the algorithm behind “You may also like” that you commonly saw in recommendation platforms[3].

Apriori Algorithem
Figure1. Apriori [3]

What is an Apriori algorithm?

Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Say, a transaction containing {milk, eggs, bread} also contains {eggs, bread}. So, according to the principle of Apriori, if {milk, eggs, bread} is frequent, then {eggs, bread} must also be frequent [4].

How Does the Apriori Algorithm Work?

In order to select the interesting rules out of multiple possible rules from this small business scenario, we will be using the following measures[4]:

  • Support
  • Confidence
  • Lift
  • Conviction
Figure2. Apriori Algorithm Work [7]

Support

Support of item x is nothing but the ratio of the number of transactions in which item x appears to the total number of transactions.

Confidence

Confidence (x => y) signifies the likelihood of the item y being purchased when item x is purchased. This method takes into account the popularity of item x.

Lift

Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being purchased when item x is sold. Unlike confidence (x => y), this method takes into account the popularity of the item y.

  • Lift (x => y) = 1 means that there is no correlation within the itemset.
  • Lift (x => y) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.
  • Lift (x => y) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, x and y, are unlikely to be bought together.

Conviction

Conviction of a rule can be defined as follows:

The formula of Conviction
Figure 3. The formula of Conviction [4]

Its value range is [0, +∞].

  • Conv(x => y) = 1 means that x has no relation with y.
  • Greater the conviction higher the interest in the rule.
Formulae for support, confidence and lift for the association rule X ⟹ Y
Figure 4. Formulae for support, confidence and lift for the association rule X ⟹ Y [5]

Now, we want to solve a problem of the Apriori algorithm in a simple way:

Part(a): Apply the Apriori algorithm to the following data set:

The set of items including milk, bread, egg, cookie, coffee and juice
Figure 5. The set of items including milk, bread, egg, cookie, coffee and juice

Step-1:

In the first step, we index the data and then calculate the support for each one, if support was less than the minimum value we eliminate that from the table.

index the data
Figure 6. Index the data

Step-2:

Calculate the support for each one

Calculate the support for each one
Figure 7. Calculate the support for each one

Step-3:

Continue to calculate the support and select the best answer

Continue to calculate the support and select the best answer
Figure 8. Continue to calculate the support and select the best answer

Part(b): Show two rules that have a confidence of 70% or greater for an itemset containing three items from part a.

Step-1:

Calculate the confidence and follow the rules of question in part(b)

Calculate the confidence
Figure 9. Calculate the confidence

Step-2:

In addition to the above rules, the following can also be considered, but in the question only two rules are required for calculation.

Rules that have a confidence of 70% or greater
Figure 10. Rules that have a confidence of 70% or greater

Hands-on: Apriori Algorithm in Python- Market Basket Analysis

Problem Statement:

For the implementation of the Apriori algorithm, we are using data collected from a SuperMarket, where each row indicates all the items purchased in a particular transaction.

The manager of a retail store is trying to find out an association rule between items, to figure out which items are more often bought together so that he can keep the items together in order to increase sales.
The dataset has 7,500 entries. Drive link to download dataset[4][6].

Environment Setup:

Before we move forward, we need to install the ‘apyori’ package first on command prompt.

Environment Setup
Figure 11. Environment Setup

Market Basket Analysis Implementation within Python

With the help of the apyori package, we will be implementing the Apriori algorithm in order to help the manager in market basket analysis [4].

which items to keep together?
Figure 12. Which items to keep together? [4]

Step-1: We import the necessary libraries required for the implementation

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step-2: Load the dataset

Now we have to proceed by reading the dataset we have, that is in a csv format. We do that using pandas module’s read_csv function [6].

dataset = pd.read_csv("Market_Basket_Optimisation.csv")

Step-3: Take a glance at the records

dataset
ave a glance at the records
Figure 13. Take a glance at the records

Step-4: Look at the shape

dataset.shape
dataset shape
Figure 14. Dataset shape

Step-5: Convert Pandas DataFrame into a list of lists

transactions = []
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])

Step-6: Build the Apriori model

We import the apriori function from the apyori module. We store the resulting output from apriori function in the ‘rules’ variable.
To the apriori function, we pass 6 parameters:

  1. The transactions List as the main inputs
  2. Minimum support, which we set as 0.003 We get that value by considering that a product should appear at least in 3 transactions in a day. Our data is collected over a week. Hence, the support value should be 3*7/7500 = 0.0028
  3. Minimum confidence, which we choose to be 0.2 (obtained over-analyzing various results)
  4. Minimum lift, which we’ve set to 3
  5. Minimum Length is set to 2, as we are calculating the lift values for buying an item B given another item A is bought, so we take 2 items into consideration.
  6. Minimum Length is set to 2 using the same logic[6].
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_cinfidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

Step-7: Print out the number of rules as list

results = list(rules)

Step-8: Have a glance at the rules

results
We print out the results as a List
Figure 15. We print out the results as a List
market basket analysis
Figure 16. Market basket analysis [4]

Step-9: Visualizing the results

In the LHS variable, we store the first item from all the results, from which we obtain the second item that is bought after that item is already bought, which is now stored in the RHS variable.
The supports, confidences and lifts store all the support, confidence and lift values from the results [6].

def inspect(results):
lhs =[tuple(result[2][0][0])[0] for result in results]
rhs =[tuple(result[2][0][1])[0] for result in results]
supports =[result[1] for result in results]
confidences =[result[2][0][2] for result in results]
lifts =[result[2][0][3] for result in results]
return list (zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ["Left hand side", "Right hand side", "Support", "Confidence", "Lift"])

Finally, we store these variables into one dataframe, so that they are easier to visualize.

resultsinDataFrame
variables into one dataframe
Figure 17. Variables into one dataframe

Now, we sort these final outputs in the descending order of lifts.

resultsinDataFrame.nlargest(n = 10, columns = "Lift")
sort these final outputs
Figure 18. Sort these final outputs

This is the final result of our apriori implementation in python. The SuperMarket will use this data to boost their sales and prioritize giving offers on the pair of items with greater Lift values [6].

Why Apriori?

  1. It is an easy-to-implement and easy-to-understand algorithm.
  2. It can be easily implemented on large datasets.

Limitations of Apriori Algorithm

Despite being a simple one, Apriori algorithms have some limitations including:

  • Waste of time when it comes to handling a large number of candidates with frequent itemsets.
  • The efficiency of this algorithm goes down when there is a large number of transactions going on through a limited memory capacity.
  • Required high computation power and need to scan the entire database[4].

Summary

Flowchart of Apriori algorithm.
Figure 19. Flowchart of Apriori algorithm[8]

Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of the dataset. It is based on different rules to discover the interesting relations between variables in the database. The flowchart above will help summarize the entire working of the algorithm[2].

Github repository for whole codes

References:

[1] https://www.softwaretestinghelp.com/apriori-algorithm/

[2] https://www.javatpoint.com/association-rule-learning

[3] https://towardsdatascience.com/underrated-machine-learning- algorithms-apriori-1b1d7a8b7bc

[4] https://intellipaat.com/blog/data-science-apriori-algorithm/

[5] Patterns of user involvement in experiment-driven software development, authors.(S. Yaman), (F. Fagerholm), (M. Munezero), (T.Männistö).December 2019, https://www.journals.elsevier.com/information-and-software-technology

[6] https://djinit-ai.github.io/2020/09/22/apriori-algorithm.html#understanding-our-used-case

[7] https://www.datacamp.com/tutorial/market-basket-analysis-r

[8] https://www.researchgate.net/figure/Flowchart-of-Apriori-algorithm_fig2_351361530https://www.researchgate.net/figure/Flowchart-of-Apriori-algorithm_fig2_351361530

Writer: Parisan Ahmadi

Contact: Medium, Github, Kaggle

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Responses (8)

What are your thoughts?