Introduction
This article will introduce you to the concept of Market Basket Analysis, a data mining technique used to identify the relationships between products that customers tend to purchase together. Specifically, I’ll be presenting an implementation of the Apriori algorithm using pure Qlik Sense. By doing so, you’ll be able to explore the possibilities and potential limitations of the Qlik data processing script. You will also learn about many use cases where Market Basket Analysis can be applied to make data-driven decisions and benefit your business.
I will be sharing two examples to illustrate the approach.
- The first example is based on transactional data from receipts.
- The second example involves B2B sales from a building materials manufacturer. In this case, the “transaction” refers to the customer rather than a one-time receipt.
What is Apriori?
The Apriori algorithm produces a set of rules that describe the relationships between different items in your dataset. This output can then be leveraged to identify which items are commonly sold or consumed together, and gain insights into purchasing patterns and customer behavior.
If you’re purchasing instant pudding, chances are you’ll also buy milk.
The algorithm not only identifies the pairs of items that are commonly sold together, but also provides insights into the direction of the association between them.
Example:
If you’re buying instant pudding, there’s a high probability (e.g. 90%) that you’ll also purchase milk.
But:
If you’re buying milk, the likelihood of you also buying instant pudding is relatively low (e.g. 10%).
Applications
It’s worth noting that this analysis can be applied beyond just the retail industry. In fact, it can be effective in many other fields where the “transaction” must not be a receipt, but rather something else – customer id, geographical location etc.
This information can be useful for businesses to:
- Recommend products “you migth be interested in…”.
- Discovering upsell opportunities.
- Optimizing shop shelves stocking (placing products in close proximity).
- Creating promotions and discounts (promoting product A to increase sales of the product B)
- Developing pricing strategies (such as bundling the products together at a discount).
- Website navigation analysis (what page is visited after which page).
- Optimizing marketing spending (promoting B when a customer baught A but save money by not promoting B to customers who baught A).
- Fraud Detection (detect fraudulent behavior by analyzing patterns of transactions/actions).
- Supply Chain Optimization (identifying frequently occurring product combinations).
Is Qlik appropriate?
Although the Apriori algorithm is often considered an advanced modeling technique, and some even classify it as machine learning, it is essentially just statistics and data manipulation. This article will showcase the versatility of Qlik Sense language by experimenting with running the algorithm with the Qlik Sense Script.
Side note: There are two main stages when using Qlik Sense for analytics.
The first step involves processing and loading the data, while the second step is opening the application and utilizing its in-memory engine.
When the data is already loaded into a Qlik Sense application, the resulting performance is extremely fast and efficient. Qlik’s in-memory query capabilities are unbeatable and can handle millions or even billions of rows of data with ease.
When it comes to data manipulation, Qlik has its own scripting language. It is an extra addition when compared to other BI tools on the market. It is quite powerful and can handle common data processing tasks such as data cleansing, joining, splitting, recalculation, and data generation. However, for more complex algorithms like Apriori, especially on larger datasets, Qlik Script may not always be the most effective solution. In that case you migth consider connecting to an external instance of Python or R. Nonetheless, for smaller datasets in alorithms, Qlik Script is more than capable of handling the job.
Apriori in Qlik Sense Script
Step 1
The script first creates a new table called “Apriori” which includes the columns “TransactionId” and “ItemToCheck”, and loads data from an existing table called “Transactions”. This step essentially prepares the transactional data for the Apriori analysis.
The source dataset must contain a “TransactionId” and “ItemId“(s) that are associated with it. The “TransactionId” can be a receipt id in retail industry, a user id with actions like movies watched or books read, or a B2B customer id with items sold to them over a longer period, such as an aggregated year.
Apriori:
Load
TransactionId,
ItemId as ItemToCheck
Resident Transactions
;
Step 2
Next, the script calculates the support (“ItemSupport“) of each individual item. Support is defined as the proportion of transactions that contain a given item.
The support can also be used to filter out infrequent items, especially when dealing with large datasets and encountering memory limits. It has a business significance as well. For example, if an item was only sold in one transaction out of ten thousand, any conclusions drawn from it may not be very meaningful.
Let TotalTransactions = FieldValueCount('TransactionId');
Left Join (Apriori)
Load
ItemToCheck,
Count(DISTINCT TransactionId) / $(TotalTransactions) as ItemSupport
Resident Apriori
Group By
ItemToCheck
;
Step 3
The script creates a new table called “CartesianOfItems“, which lists all distinct items in the dataset. This table will be used in the following steps to generate all possible pairs of items.
CartesianOfItems:
Load distinct
ItemToCheck as ItemOneInPair
Resident Apriori
;
Step 4
The script performs a self-join on the “CartesianOfItems” table to generate all possible pairs of items.
The result will simply be all possible pairs of two items. While this solution looks into pairs only, more advanced techniques can analyze baskets of three, four, or more items.
Join (CartesianOfItems)
Load
ItemOneInPair as ItemTwoInPair
Resident CartesianOfItems
;
Step 5
The script creates a new table called “CandidatePairs“, which includes the columns “PairId“, “Association“, “AssociationDirection“, and “ItemToCheck“. The “PairId” column is a unique identifier for each pair of items, and the “Association” and “AssociationDirection” columns are used to describe the relationship between the two items in each pair. The “ItemToCheck” column simply lists the first item in each pair.
The order {A, B} or {B, A} is not important, it is just an Id, but must be used consistently.
CandidatePairs:
Load
'(' & ItemOneInPair & ', ' & ItemTwoInPair & ')' as PairId,
'(' & ItemOneInPair & '→' & ItemTwoInPair & ')' as Association,
'>' as AssociationDirection,
ItemOneInPair as ItemToCheck
Resident CartesianOfItems
Where AutoNumber(ItemOneInPair) < AutoNumber(ItemTwoInPair)
;
Step 6
The script concatenates the “CandidatePairs” table with a modified version of itself, where the “Association” and “AssociationDirection” columns are reversed (“→” changes to “←” and <“), and the “ItemToCheck” column now lists the second item in each pair.
This step essentially generates a table with two rows for each pair. The pair id is the same and the column ItemToCheck contains both ItemIds in seperate rows. The ItemToCheck column will be later the key to join to transactions.
Concatenate (CandidatePairs)
Load
'(' & ItemOneInPair & ', ' & ItemTwoInPair & ')' as PairId,
'(' & ItemTwoInPair & '←' & ItemOneInPair & ')' as Association,
'<' as AssociationDirection,
ItemTwoInPair as ItemToCheck
Resident CartesianOfItems
Where AutoNumber(ItemOneInPair) < AutoNumber(ItemTwoInPair)
;
Drop Table CartesianOfItems;
Step 7
The script performs a join between the “Apriori” table and the “CandidatePairs” table, using the “ItemToCheck” column as the key. This step adds the “PairId“, “Association“, and “AssociationDirection” columns to the “Apriori” table.
Left Join (Apriori)
Load
ItemToCheck,
PairId,
Association,
AssociationDirection
Resident CandidatePairs;
Drop Table CandidatePairs;
Step 8
The script calculates the number of times each pair hase been joined to each transaction, and adds a new column called “PairOccuresInTransaction” to the “Apriori” table.
This step is crucial because it’s possible that only one item of a pair is present in a transaction which will result in the pair being joined only once. Therefore, only when both items of a pair are present in the same transaction, that is a pair has been joined twice, can a true association be identified.
0 is used for False nad 1 is used for True.
Left Join (Apriori)
Load
TransactionId,
PairId,
If(Count(TransactionId)=2,1,0) as PairOccuresInTransaction
Resident Apriori
Group By
TransactionId,
PairId
;
Step 9
Now the script calculates the support of each pair, and creates a new table called “Support“. This step involves counting the number of transactions in which each pair occurs, and dividing by the total number of transactions in the dataset.
The script uses FieldValueCount function to quickly count the number of transactions in the dataset. This function is lightning fast and efficiently counts the distinct values in a field.
The support is simply the number of transactions with apperance of a pair divided by the number of all transactions.
For example, if a pair {A, B} occurs in 5 out of 100 transactions, then the “Support” of this pair is 5%.
Let TotalTransactions = FieldValueCount('TransactionId');
PairSupport:
Inner Join (Apriori)
Load
PairId,
Count(DISTINCT TransactionId) / $(TotalTransactions) as Support
Resident Apriori
Where PairOccuresInTransaction = 1
Group By
PairId
;
Step 10
Finally, the script creates a new table called “Confidence“, which includes the columns “Association” and “Confidence“. The “Confidence” column is calculated as the ratio of the support of the pair to the support of the first item in the pair. This step essentially ranks the pairs of items by their level of association or dependence, and can be used to make recommendations or create product bundles based on frequent itemsets.
Confidence({A} -> {B}) = support({A, B}) / support({A})
Confidence({B} -> {A}) = support({A, B}) / support({B})
Confidence:
Load Distinct
Association,
Num(PairSupport / ItemSupport ,'#.##0,0%') as Confidence
Resident Apriori
;
Confidence {A} -> {B} and {B} -> {A} will be calculated in a single Load pass thanks to the “Association” differentiator.
Results
The script generates a list of all pairs (“PairId“) that have appeared in one or more transactions, along with their frequency (“Support“). However, not all pairs are equally useful for prediction, as many have low support and therefore low predictive power.Some items may have a “Confidence” value of 100%, but this may be misleading if the pair only appears once in transactions. Therefore, when analyzing the results, it’s important to consider a balance of multiple parameters simultaneously.
At this stage, it is will make sense to filter the support to exclude pairs with low frequency, as they may have little predictive power. The percentage of support to be considered depends on the number of distinct products in the dataset. In this example, we will filter and consider only pairs that appeared in at least 2% of the transactions (Support > 0.02).
The results look more reasonable now.
Let us now filter the first pair. The resulting output provides a clear explanation of the difference between a pair and an association. One pair will always have two associations with (most often) different “Confidence” levels.
This example illustrates the direction of an association.
Whole milk is purchased with a confidence of 58% when ham is bought. However, vice versa, the purchase of whole milk does not necessarily imply the purchase of ham, as confidence is only 8%.
The results can be visualized in a scatter plot where each dot represents each “Pair“. The x-axis represents the strength of the {A} -> {B} association and the y-axis represents the strength of the {B} -> {A} association.
Section 1 groups items that have a strong one-directional association. Product A is often purchased when Product B is in the basket, but not the other way around.
Section 2 would represent items that are very often baught together with no clear dominance of one item over the other.
Section 3 is a “mirror” reflection of Section 1. Product B is often purchased when Product A is in the basket, but not the other way around.
Section 4 groups pairs of items with low confidence, meaning that the association between them is weak. The closer to the bottom left corner of the scatter plot (i.e., 0-0), the weaker the association between the items in the pair.
Let’s hover over one dot, which turns to be “other vegetables+frozen chicken”
When a customer buys frozen chicken, there is a high probability that they will also purchase vegetables (B→A: 80%).
But
When a customer buys vegetables, the likelihood of them also buying frozen chicken is low (A→B: 0.3%).
B2B sales of a building materials manufacturer
In the above examined data set there are no pairs in Sector 2, where items that frequently go together (without dominance of one item over the other) would be located.
This will change if we examine another data set.
Rather than analyzing individual transactions, you can aggregate the data to the customer level and treat each customer as a “TransactionId”. This approach will reveal associations between products that customers tend to buy together. By combining the apriori results with actual sales, you can identify customers who have only purchased one product from a popular pair. This way, you can uncover opportunities for upselling and increasing revenue.
I conducted this analysis on data from a building materials manufacturer, which contains fewer “transactions” compared to the receipt data. However, the “baskets” in this data contain more items, which results in the identification of very strong association rules.
In this particular case, there are many pairs of items sold together without any specific directional relation, which are located in the upper right corner (Section 2) of the scatter plot.
There are fewer items to the left and bottom because the support of pairs was filtered.
The color represents Lift, an additional measure that can be taken into account.
Lift is an additional measure to evaluate the strength of association between two items. It compares the likelihood of the two items being purchased together to the probability of them being purchased separately. Lift values greater than 1 indicate a positive association, where the two items are more likely to be purchased together, while values less than 1 indicate a negative association, where the two items are less likely to be purchased together. A lift value of 1 indicates that there is no association between the two items.
If you have any interesting comments or additional use cases, please let me know!