"Association Rule Mining a.k.a Market Basket Analysis extracts underlying patterns and relationships that are otherwise not so apparent. The co-occurrences of data items can reveal inherent dependencies and establish rules of specific strength, often useful as a recommendation mechanism. Here is how you can quickly implement this.."
Measures of Association Rules
The following measures are used to evaluate the strength of association. Suppose, you are interested in the association between two events A and B:
- Support = Number of Rows having both A AND B / Total Number of Rows
- Confidence = Number of Rows having both A AND B / Number of Rows with A
- Expected Confidence = Number of rows with B / Total Number of Rows
- Lift = Confidence / Expected Confidence.
Lift is the association growth factor by which the co-occurence A AND B exceeds the expected probability when there is no relation between events A and B. In other words, higher the lift ( > 1), higher the chance of co-occurrence of B with A.
# Load the libraries
data (Groceries) # Load the data set
By default, the class of ‘Groceries’ dataset is a ‘transactions’ type. Since ‘arules’ package is designed to work with ‘transactions’ class, it is desirable to convert your dataframe to this class. Here is how you can convert it.
transDat <- as (myDataFrame, "transactions") # convert to 'transactions' class
Some Groundwork: Methods of ‘Transactions’ class dataset
inspect (transDat) # view the observations
length (transDat) # get number of observations
size (transDat) # number of items in each observation
LIST(transDat) # convert 'transactions' to a list, note the LIST in CAPS
Lets Apply Apriori Algorithm
For illustrative purpose, Lets continue to work with ‘Groceries’ dataset from ‘arules’ package.
frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
itemFrequencyPlot (Groceries,topN=10,type="absolute") # plot frequent items
A low support and high confidence helps to extract strong relationship even for less overall co-occurrences in data.
rules <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) # Min Support as 0.001, confidence as 0.8.
quality(rules) # show the support, lift and confidence for all rules
# Show the top 5 rules, but only 2 digits
rules <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
How To Control The Number Of Rules in Output ?
Adjust the maxlen and conf arguments in the apriori statement to control the number of rules generated. Use your best judgement here.
rules <- apriori (Groceries, parameter = list (supp = 0.001, conf = 0.5, maxlen=3)) # maxlen = 3 limits the elements in a rule to 3
- To get ‘strong‘ rules, increase the value of ‘conf’ parameter.
- To get ‘longer‘ rules, increase ‘maxlen’
How To Remove Redundant Rules ?
Use the below code to find out and filter the redundant rules.
redundant <- which (colSums (is.subset (rules, rules)) > 1) # get redundant rules in vector
rules <- rules[-redundant] # remove redundant rules
How to Find Rules Related To Given Item/s ?
This method is the core of ‘Market basket analysis’ that is useful to make recommendations of new items to your users. This can be achieved by modifying the ‘appearance’ parameter in the apriori() function. For example,
Find what factors influenced an event ‘X’
To find out what customers had purchased before buying ‘Whole Milk’. This will help you understand the patterns that led to the purchase of ‘whole milk’.
rules <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.08), appearance = list (default="lhs",rhs="whole milk"), control = list (verbose=F)) # get rules that lead to buying 'whole milk'
Find out what events were influenced by a given event
In this case: the Customers who bought ‘Whole Milk’ also bought. In the equation, ‘whole milk’ is in LHS (left hand side).
rules <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.15,minlen=2), appearance = list (default="rhs",lhs="whole milk"), control = list (verbose=F)) # those who bought 'milk' also bought..
Sort the rules, filter the redundant ones and show the Top 7 Rules.
rules <- sort (rules, decreasing=TRUE,by="confidence")
redundant <- which (colSums(is.subset(rules, rules)) > 1) # get redundant rules in vector
rules <- rules[-redundant] # remove redundant rules inspect (rules[1:7])
Making Rules For Continuous Data
If you try to make rules on continuous variables, each value will be treated as distinct item, causing undesirable explosion of rules. So, convert the continuous variables to factors, which can be easily done using discretize() function.
discretize (x, method="cluster", categories=3) # method can make cuts in equal "intervals", "frequency", "cluster", "fixed"
Visualizing The Rules
# Interactive Plot
plot (rules[1:25],method="graph",interactive=TRUE,shading="confidence") # feel free to expand and move around the objects in this plot
plot (rules, measure=c("support", "lift"), shading="confidence")
More Useful Functions
affinity(transDat) # Calculates affinity - the 'nxn' Jaccard Index affinity matrix
transDat_c <- addComplement(transDat, "Item 1") # Adds "Item 1" to all transactions in transDat
duplicated(rules) # find out if any rule is duplicated