Facebooktwittergoogle_plusredditpinterestlinkedinmail

Companies usually have a large set of transaction, where each transaction includes the list of purchased items by a consumer. These transactions are mostly stored in databases, and a consumer might buy various products in different times. The companies are interested to analyze these transactions to model the baskets of consumers. It means that they want to find association between consumers’ purchased products and discover products that are bought together frequently. Such an analysis, also known as market basket analysis, is useful for activities such as product placement in the store, and building online recommendation engines. For example, we may find that e.g. Product_1 and Product_2 have been purchased by 1% of the consumers, and also 75% of these consumers have also bought Product_3. In this situation, it is an advantageous business decision to advertise the Product_3 to buyers of both Product_1 and Product_3.

Suppose that a list of transactions is available, each one represents a set of products purchased by a consumer. The number of possible rules between these products is a very large number, and thus brute-force searching among these possible rules would not be a computationally efficient method. The “Apriori” algorithm provided a better approach by adding some constraints on the rules. Let’s define:

  • Support of a products X={x1,…,xN}: The number of transactions that contain all of products in X, divided by the total number of transactions

The Apriori method uses the fact that if the support of the products X is a number k, then any product in X has support of at least k. The Apriori algorithm works as below. First, a value supmin is selected for the minimum support. Second, all the products with support of at least supmin are identified. In the next step, the frequent products of the second step are paired (i.e. {x1, x2}) and their supports are compared with supmin. In this way, the frequent pairs can be identified. The algorithm is continued with higher numbers of pairs (3, 4, …) until no other pairs with support of at least supmin could be found. One can set a threshold N on maximum number of products in the pairs for earlier stopping of the algorithm.

Finally, a set of rules are extracted from frequent rules. Each of these rules is in the form X → Y, meaning that when product X (Antecedent) is purchased, the product Y (Consequents) is also purchased. In this case, the left-hand side (LHS) of the rule is X, and the right-hand side (RHS) of the rule is Y. For example, the frequent pair {x1, x2} can be shown as two rules: x1 → x2 and x2 → x1.

For evaluating the generated rules, the following criteria are defined:

  • Confidence of a rule X → Y: The number of transactions that contain both X and Y, divided by number of transactions contain X.

The greater the value of confidence, means that the more likely that consumer buy products X and Y together.

  • Lift of a rule X → Y: how many times more often X and Y bought together than the case that they are statistically independent of each other

Lift is 1 if X and Y are statistically independent of each other. In contrast, A larger value of lift suggests a greater strength of the association between X and Y.

The Apriori algorithm has been implemented in R and Python, two common languages of data science. Here, we have shown the implementation of the algorithm on a list of transactions. The dataset is a dataframe where each row contains the id of purchaser and a purchased home appliance. Since some consumers have bought several products, multiple rows may have the same purchaser ids. The dataset header is as below:

market basket analysisImplementation of Apriori in R

The Apriori algorithm is implanted in arules package in R. In the first step, the data should be converted to a data with class of transactions.

apriori in R market basket analysis

The above results show that there are about 220000 transactions in our database with 18 types of home appliances. Now, the Apriori algorithm can be applied by predefined values of support and confidence. Also, we add an optional constraint that there are at least 3 products in the generated rules.

apriori market basket analysis

If we are interested to extract only the rules with consequent of Microwave or Refrigerator, we can revise the above command:

Since the rules are in R dataframe format, they can be easily sorted or sliced based on the confidence, lift or support values. In addition, package arulesViz provides several types of graphical representation for the Apriori rules.

Implementation of Apriori in Python

The Apriori algorithm is implanted in mlxtend package in Python. The input to this package is a pandas dataframe where each row represents the bought products of a consumer. Each column of this dataframe represents a product name. If consumer y1 bought product x1 and x2, the row of y1 has values of 1 for columns of x1 and x2, and values of 0 for remaining columns. Therefore, the transaction data should be converted to the appropriate format for mlxtend.

market basket analysis import pandas

Now, the Transactions dataframe can be analyzed using mlxtend.

Since the resulting rules are stored in dataframe format, regular commands of pandas can be used for sorting and slicing. For example, if we are interested only to rules with antecedent length of greater than 2:

apriori algorithm for market basket analysis

Summary:

Apriori algorithm is a computationally efficient algorithm to find a set of rules among some transactions. The algorithm employs market basket analysis, which analyzes purchases of several products made by consumers and finds which group of products are usually bought together. We have presented the implementation of Apriori algorithm both in R and Python.

References:
Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

 

Login

Register | Lost your password?