- FP Growth Algorithm: Introduction
- Finding Frequent Itemsets
- Generation of strong association rules from frequent item sets
Contributed by: Sudip Das
LinkedIn Profile: https://www.linkedin.com/in/sudip-das29
Introduction
To understand FP Growth algorithm, we need to first understand association rules.
Association Rules uncover the relationship between two or more attributes. It is mainly in the form of- If antecedent than consequent. For example, a supermarket sees that there are 200 customers on Friday evening. Out of the 200 customers, 100 bought chicken, and out of the 100 customers who bought chicken, 50 have bought Onions. Thus, the association rule would be- If customers buy chicken then buy onion too, with a support of 50/200 = 25% and a confidence of 50/100=50%.
Association rule mining is a two-step process:
- Finding frequent Itemsets
- Generation of strong association rules from frequent itemsets
Finding Frequent Itemsets
Frequent itemsets can be found using two methods, viz Apriori Algorithm and FP growth algorithm.
Apriori algorithm generates all itemsets by scanning the full transactional database. Whereas the FP growth algorithm only generates the frequent itemsets according to the minimum support defined by the user. Since Apriori scans the whole database multiple times, it Is more resource-hungry and the time to generate the association rules increases exponentially with the increase in the database size. On the other hand, the FP growth algorithm doesn’t scan the whole database multiple times and the scanning time increases linearly. Hence, the FP growth algorithm is much faster than the Apriori algorithm.
So the topic of discussion will be limited to the FP growth algorithm in this post.
Also Read: Clustering Algorithm in Machine Learning
1. FP Tree construction by compressing the DB representing frequent items
Compressing the transactional database to mine association rules by finding frequent itemsets into a frequent pattern tree or FP-tree. This also retains the itemset association information.
So let’s start with a small transaction data to understand the construction of the FP tree. The transaction which we consider here suppose consists of 5 items such as-
Asparagus (A), Corn (C), Beans (B), Tomatoes (T) & Squash (S)
Table 1
Transaction ID | List of items in the transaction |
T1 | B , A , T |
T2 | A , C |
T3 | A , S |
T4 | B , A , C |
T5 | B , S |
T6 | A , S |
T7 | B , S |
T8 | B , A , S , T |
T9 | B , A , S |
So for example, for the first transaction T1 consists of three items such as Beans (B), Asparagus (A), and Tomatoes (T). Similarly, the transaction T6 contains the items Asparagus (A) and Squash (S). Let us also consider the minimum support for this small transaction data to be 2. Hence, min_support = 2.
First of all, we need to create a table of item counts in the whole transactional database as below:
Table 2
Item | Support Count |
Beans (B) | 6 |
Asparagus (A) | 7 |
Squash (S) | 6 |
Corn (C) | 2 |
Tomatoes (T) | 2 |
This is simply the count of each item, such as if we see Squash (S) has been bought in 6 transactions viz, T3, T5, T6, T7, T8 & T9, so the support count is 6 for Squash.
Next, we just need to sort the item list in descending order for their support count. Hence the table of support count may now be as represented in table 3 below:
Table 3
Item | Support Count |
Asparagus (A) | 7 |
Beans (B) | 6 |
Squash (S) | 6 |
Corn (C) | 2 |
Tomatoes (T) | 2 |
Beans (B) & Squash (S) have the same support count of 6 and any of them can be written first. Here, we have written Beans (B) first. Similarly, Corn(C) & tomatoes can also be listed in the same fashion.
Now we are ready to start with the construction of the FP Tree. The FP tree root node is usually represented with a NULL root node. Now consider the Transaction T1. T1 consists of Beans (B), Asparagus (A) & tomatoes (T). Now out of these three items, we need to look for the item which has the maximum support count. Since Asparagus (A) has the highest support count of 7, we will extend the tree from its root node to A as Asparagus. Since this is the first transaction, the count is denoted by A:1. Let’s look further, out of Beans(B) and Tomatoes (T) the support count of Beans is 6, and the support count for Tomatoes is 2. From Asparagus, we can extend the tree to Beans (B:1 for the first transaction consisting of beans) and after that T:1 for Tomatoes. The tree structure is as below in Figure 1.
Going further in the transaction T2, there are two items viz Asparagus(A) and Corn (C). Since the support count for Asparagus is 7 and Corn (C) is 2, we need to consider Asparagus (A) first and going from the root node we need to see if there is any branch that is extended to Asparagus (A) which is true in this case, and we can increase the count as A:2 (Figure 2). But after that, there is no node connected to Asparagus (A) to corn (C) we need to create another branch for Corn as C:1.
Similarly, for transaction T3 we have Asparagus (A) then Squash (S) in the descending order of their support count. So Asparagus(A) count has been increased from A:2 to A:3, and further, we can see that there aren’t any nodes for Squash from Asparagus, so we need to create another branch going for a Squash node S:1, as described in Figure 3.
For transaction 4, we can draw the node as below shown in Figure 4. Here the count for Asparagus and Beans has been increased to A:4 and B:2, but after that, we can see there isn’t any branch that extends to corn (C), so we need to create a node for C:1.
For transaction 5, we can see that it contains Beans (B) and Squash (S), but there aren’t any direct node linked to Beans from the Null root node, we need to create a new branch from the Null node as depicted in Figure 5 with B:1 and S:1 counts.
Transaction 6 to transaction 10 are self-explanatory. For your reference, the figures have been provided for each transaction. So for transaction 6 check figure 6, for transaction 7 check Figure 7, and so on.
Now we can cross-verify using Figure 9 and Table 3. The count for Asparagus (A) stands at A:7, which is similar in table 3. If we verify for Beans, there are two nodes B:4 and B:2 hence, the count matches with Beans count of 6 in table 3.
Also Read: Top Machine Learning Interview Questions for 2020
2. Compressing of Conditional DB
Now we need to construct a table for conditional pattern base and hence, the frequent pattern generation. Now let us consider the table in Figure 10, we need to consider the item which has the lowest support count, here, Tomatoes (T) has the lowest support count of 2. We need to see where the tomatoes are. We can see there are two traversal paths for tomatoes (T) from the root node. One of them being A-B-T, where T is having count 1 i.e., T:1. So we can create the conditional pattern base as in Row 1 of the table 4 {A, B:1}.
Also, the second traversal path is A-B-S-T and T has a count of 1 (T:1) so the conditional pattern base is {A, B, S;1}. So there are two conditional pattern bases {{A,B:1},{A,B,S:1}}. Now from this, we will construct the conditional FP tree column in Table 4. Here A is in {A,B:1} and {A,B,S:1} so both have a final count of 1 each summing up to <A:2>, in the similar fashion <B:2> and <S:1> finally it comes out to be <A:2,B:2, S:1>, but we will not consider S:1 since we have taken the minimum support count to be 2. So finally the value should be <A:2, B:2> for the first row i.e., Tomatoes (T).
Now to construct the Frequent pattern generation (the last column for table 4) we need to join the Conditional FP tree column with Item in table 4. Joining <A:2> with Tomatoes (T) we can write {A,T:2}, after that join <B:2> with tomatoes (T) it comes out to be {B,T:2}. Then we need to join A and B both with T hence {A,B,T:2} so the frequent pattern generation for the row of Tomatoes (T) becomes {A,T:2},{B,T:2},{A,B,T:2}.
Here both the A & B have a count of 2 so we have written {A, B, T:2}, but sometimes the case may arise that and B have different counts, in that case, we need to consider the minimum count of the item and write in the same fashion.
Now for the Corn (C) rows, we can see in Figure 10 that we can reach Corn (C) through two different paths A-B-C and A-C. In both the traversal paths, the count for Corn (C) is 1. Hence, {{A,B:1},{A:1}} is our conditional pattern base. Now for the Conditional FP tree column, we need to check for a count of each item in the conditional pattern base, for we have 2 as an account of And 1 as count of B. Since the minimum support count we have considered is 2, we need to neglect B. So the conditional FP Tree columns stands to be <A:2>. After this, join Conditional FP tree column with Item column for Corn (C) which comes out to be {A, C:2}
Now consider the Squash (S) row. We can reach Squash through 3 traversal paths A-B-S, A-S, and B-S. Since the count of Squash is 2 in A-B-S we can write {A,B:2} similarly, for the other two paths we can write {A;2} & {B:2} so the Conditional pattern base stands at {{A,B:2},{A:2},{B:2}}. For conditional FP tree columns, we can see that <B:2> has been reached from the right branch from the Null node and the other two through the left branch originating from the Null root.
Hence, we need to write differently and not add in one, we can write <A:4, B:2>,<B:2>. Now joining A:4 from <A:4,B:2> with Squash (S) gives {A,S:4}. Now from <A:4,B:2> joining with Squash (S) gives },{B,S:2} but we have written {B,S:4}. We have written {B, S:4} the count as 4 since we have another <B:2> in the Conditional FP tree column. Summing both B with S from <A:4,B:2> and B with S from <B:2> the count comes out to be {B,S:4}. Now joining <A:4,B:2> with S (the final joining all items) we get {A,B,S:2} not {A,B,S:4} since we need to consider the minimum count which is 2. Taking the minimum count is required since we need to check the frequent counts where both A & B occur with S not only one of the items. These three items Asparagus (A), Beans (B), Squash (S) occur only twice in this case so {A, B, S:2}.
Similarly, the last row for Bean is constructed. Also, Asparagus is directly from the Null node and since there aren’t any in-between nodes to reach Asparagus (A), there is no need to go for another row of Asparagus.
Table 4
Item | Conditional Pattern base | Conditional FP tree | Frequent Pattern Generation |
Tomatoes (T) | {{A,B:1},{A,B,S:1}} | <A:2,B:2> | {A,T:2},{B,T:2},{A,B,T:2} |
Corn (C) | {{A,B:1},{A:1}} | <A:2> | {A,C:2} |
Squash (S) | {{A,B:2},{A:2},{B:2}} | <A:4,B:2>,<B:2> | {A,S:4},{B,S:4},{A,B,S:2} |
Bean (B) | {{A:4}} | <A:4> | {A,B:4} |
Generation of strong association rules from frequent item sets
As already discussed, the FP growth generates strong association rules using a minimum support defined by the user, and what we have done till now is to get to the table 4 using minimum count=2 and finally generated frequent Item sets which are in the last column of the Frequent Pattern Generation in table 4. So considering table 4’s last column of frequent Pattern generation we have generated a 3-item frequent set as {A,B,T:2} & A,B,S:2}. Similarly 2-Item frequent sets are {A,T:2},{B,T:2}, {A,C:2}, {A,S:4},{B,S:4} & {A,B:4}. Also, we can see that we have arrived at distinct sets. None of the sets are similar in the case.
In comparison to the Apriori Algorithm, we have generated only the frequent patterns in the item sets rather than all the combinations of different items. For example, here we haven’t generated Tomatoes (T) & Squash (S) or Corn (C) & Beans(B) since they are not frequently bought together items which is the main essence behind the association rules criteria and FP growth algorithm.