- Apriori Algorithm: Introduction
- Who Introduced it?
- Assumptions
- How do decide the frequency?
- How to improve efficiency?
Contributed by: Shreya Shetty
LinkedIn Profile: https://www.linkedin.com/in/shreya-shetty-9a070792/
Before we deep dive into the Apriori algorithm, we must understand the background of the application.
In the era of online shopping, we still take out some time to visit supermarkets for quick pick up. Have you ever wondered why certain items are placed together and are there any reason behind their placement? Give it a thought, conditioners placed near shampoo. Onions and potatoes kept at proximity. Adding more to the list, bread, and jam, sugar and tea bags, etc. Even if we see discounts given on certain items, such as on buying baby soap, we get a discount on lotion. The shopkeeper knows the customers’ sentiment and makes a profit out of it. This is nothing but a market basket analysis. Isn’t that a great way of optimizing the sales?
All these are the perfect examples of Association Rules. It can also be applied in the medical field to determine which symptoms tend to co-exist and help in diagnosis and speedy recovery. It is one of the methods of data mining. Others are as below:
- Correlation
- Classification
- Clustering
Definition
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. Here variables are Items. Databases are places where historic transactions are stored (buying patterns of customers).
I will quickly highlight a few concepts which are required to be understood before going further on the Apriori Algorithm.
Itemset: A set of items together is called an itemset. An itemset consists of two or more items.
Frequent Itemset: Itemset that occurs frequently is called a frequent itemset. A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
Support: Tells us how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. Support shows transactions with items purchased together in a single transaction. Consider 5 transactions are under study and say Milk is purchased in 3 Transactions.
Support for Milk= 3/5
Confidence: Shows transactions where the items are purchased one after the other. How likely item Y is purchased when item X is purchased, expressed as {X -> Y}. Say Milk and Bread are analysed together. Bread is purchased after Milk 2 times.
Confidence (Milk->Bread) = Support for (Milk, Bread)/Support for Milk=2/Support for Milk
Drawback of Confidence is it only accounts for how popular milk is, but not bread which might misrepresent the importance of an association.
Lift: How likely item Y is purchased when item X is purchased, also controlling for how popular item Y is. Say bread was purchased 2 times out of 5 transactions-
Support for Bread=2/5
Lift (Milk->Bread) = Support for (Milk, Bread)/Support for Milk*Support for Bread
Let’s relate all these to the Apriori Algorithm.
Association rule mining has to:
- Find all the frequent items.
- Generate association rules from the above frequent itemset.
Frequent itemset or pattern mining is based on:
- Frequent patterns
- Sequential patterns
- Many other data mining tasks.
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining.
Why the name?
It uses prior(a-prior) knowledge of frequent itemset properties.
Who introduced it?
Rakesh Agrawal and Ramakrishnan Srikant in 1994.
Assumptions
- All subsets of a frequent itemset must be frequent (Apriori property)
- If an itemset is infrequent, all its supersets will be infrequent and thus can be ignored (Antimonotone property)
How to decide on the frequency?
A minimum threshold is set on the expert advice or user understanding.
Steps:
- Join Step: This step generates (K+1) itemset from K-item sets by joining each item with itself.
- Prune Step: This step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent, and thus it is removed. This step is performed to reduce the size of the candidate itemsets.
Detailed steps:
- Set a threshold support level. Say 50% for example-
Transaction ID | Milk | Bread | Butter | Sugar | Potato |
t1 | 1 | 1 | 1 | 0 | 0 |
t2 | 0 | 1 | 1 | 1 | 0 |
t3 | 0 | 1 | 0 | 1 | 1 |
t4 | 1 | 1 | 0 | 1 | 0 |
t5 | 1 | 1 | 1 | 0 | 1 |
t6 | 1 | 1 | 1 | 1 | 1 |
- Create a frequency table of all the items that occur in all the transactions. Prune the frequency table to include only those items having a threshold support level over 50%.
Item | Freq |
Milk | 4 |
Bread | 6 |
Butter | 4 |
Sugar | 4 |
- Make pairs of every item as below, and calculate the frequency from transaction table:
Itemset | Freq |
Milk Bread | 4 |
Milk Butter | 3 |
Milk Sugar | 2 |
Bread Butter | 4 |
Bread Sugar | 4 |
Butter Sugar | 2 |
Apply the same threshold, and we finally get milk bread, bread butter, and bread sugar.
- Now analyse 3 item sets. We have milk bread butter, bread butter sugar, and milk bread sugar. Repeat the previous step to calculate the frequency, and apply the threshold to eliminate the non-frequent item set.
Itemset | Freq |
Milk Bread Butter | 3 |
Milk Bread Sugar | 2 |
Bread Butter Sugar | 2 |
We are left with milk, bread, and butter. In real-time, we will have a huge number of transactions to go through to get these results. There will be multiple combinations which go on to arrive at the best results or association of items.
Pros:
- Simple algorithm
- Easy to implement on large itemsets in large databases using Joint and prune steps
Cons:
- It requires high computation if the item sets are very large and the minimum support is kept very low
- The algorithm scans the database too many times, which reduces the overall performance
- Time and space complexity of this algorithm is very high
Also Read: Top Machine Learning Interview Questions for 2020
How to Improve Efficiency?
- Hash-Based Technique
- Transaction Reduction
- Partitioning
- Sampling
- Dynamic Itemset Counting:
Applications:
Some of them are already mentioned during the introduction and others are:
- In the Recommendation system by e-commerce companies.
- Autocomplete feature by search engine.
- Find association in the student’s database, patients’ database, etc.
Conclusion
For a beginner, it provides an easy way to understand the association rules and quickly apply for market basket analysis. Although there are limitations, we can consider this in many applications.
Seize the opportunities that await you through our dynamic range of free courses. Whether you’re interested in Cybersecurity, Management, Cloud Computing, IT, or Software, we offer a broad spectrum of industry-specific domains. Gain the essential skills and expertise to thrive in your chosen field and unleash your full potential.