Most companies have a large number of transactions, and each transaction includes a list of products purchased by a consumer. A consumer may buy different products at different times, and these transactions are usually stored in databases.

Therefore, companies are interested in **analyzing these transactions to model ‘baskets’ of consumers**. In other words, they want to find an association between consumers’ purchased products and discover products that were frequently bought together. Such an analysis, known as a Market Basket Analysis, is useful for activities such as product placement in stores, and building online recommendation engines. For instance, we may find that Product_1 and Product_2 have been purchased by 1% of consumers, and 75% of these consumers have also bought Product_3. In this case, it is beneficial to advertise Product_3 to buyers of both Product_1 and Product_3.

Suppose that a list of transactions is available, and each one represents a set of products purchased by a consumer. The number of possible rules between these products is a very large number, and thus brute-force searching among these possible rules would not be a computationally efficient method.

The “Apriori” algorithm provides a better approach by adding some constraints on the rules. Let’s define:

- Support of a products X={x
_{1},…,x_{N}}: The number of transactions that contain all of products in X, divided by the total number of transactions

**The Apriori method uses the fact that if the support of products X is a number k, then any product in X has support of at ****least k.** The Apriori algorithm works as below:

First, a value sup_{min} is selected for the minimum support.

Second, all the products with support of at least sup_{min }are identified.

In the next step, frequent products of the second step are paired (i.e. {x_{1}, x_{2}}) and their supports are compared with sup_{min}. This way, the frequent pairs can be identified. The algorithm continues with higher numbers of pairs (3, 4, …) until no other pairs with support of at least sup_{min }can be found. A threshold *N* can also be set on maximum number of products in the pairs to stop the algorithm earlier.

Finally, a set of rules is extracted from frequent rules. Each of these rules is in the form X → Y, meaning that when product X (Antecedent) is purchased, the product Y (Consequent) is also purchased. In this case, the left-hand side (LHS) of the rule is X, and the right-hand side (RHS) of the rule is Y. For example, the frequent pair {x_{1}, x_{2}} can be shown as two rules: x_{1} → x_{2} and x_{2} → x_{1}.

To evaluate the generated rules, the following criteria are defined:

- Confidence of a rule X → Y: The number of transactions that contain both X and Y, divided by number of transactions contain X.

The greater the value of confidence, the more likely that a consumer buys products X and Y together.

- Lift of a rule X → Y: How many times more often X and Y bought together than the case that they are statistically independent of each other

Lift is 1 if X and Y are statistically independent of each other. In contrast, a larger value of lift suggests a greater strength of the association between X and Y.

The Apriori algorithm has been implemented in R and Python, two common languages of data science. Here, we have shown the implementation of the algorithm on a list of transactions. The dataset is a data frame where each row contains the id of a purchaser and a purchased home appliance. Since some consumers bought several products, multiple rows may have the same purchaser IDs. The dataset header is as below:

## Implementation of Apriori in R

The Apriori algorithm is implanted in arules package in R. In the first step, the data should be converted to a data with class of transactions.

The above results show that there are about 220000 transactions in the database with 18 types of home appliances. Now, the Apriori algorithm can be applied by predefined values of support and confidence. We also add an optional constraint that there are at least 3 products in the generated rules.

If we are interested in extracting only the rules with consequent of Microwave or Refrigerator, we can revise the above command:

Since the rules are in R dataframe format, they can be easily sorted or sliced based on the confidence, lift or support values. In addition, package arulesViz provides several types of graphical representation for the Apriori rules.

## Implementation of Apriori in Python

The Apriori algorithm is implanted in mlxtend package in Python. The input to this package is a pandas dataframe where each row represents the bought products of a consumer. Each column of this dataframe represents a product name. If consumer y_{1} bought product x_{1} and x_{2}, the row of y_{1} has values of 1 for columns of x_{1} and x_{2}, and values of 0 for remaining columns. Therefore, the transaction data should be converted to the appropriate format for mlxtend.

Now, the Transactions dataframe can be analyzed using mlxtend.

Since the resulting rules are stored in dataframe format, regular commands of pandas can be used for sorting and slicing. For example, if we are interested only in rules with antecedent length of greater than 2:

## Summary

Apriori algorithm is a computationally efficient algorithm to find a set of rules among some transactions. **The algorithm employs market basket analysis, which analyzes purchases of several products made by consumers and finds which group of products are usually bought together.** We have presented the implementation of Apriori algorithm both in R and Python.

##### References

- Dietrich, David, Barry Heller, and Biebie Yang. “Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons, (2015).
- Raschka, Sebastian, “Mlxtend “, (2016). http://dx.doi.org/10.5281/zenodo.594432
- Arules documentation, http://s2.smu.edu/IDA/arules/