6 minute read

What is Association Rule Mining?

Association rule mining is a rule-based machine-learning technique designed to discover meaningful relationships between items within extensive databases. This method enables the identification of products that are frequently purchased together with a specific item. By conducting association rule analysis, businesses can optimize product displays, determine effective product bundling strategies, and formulate targeted marketing approaches based on customer purchasing patterns.

How to analyze association rules

In this project, I will use the Apriori algorithm, which is a representative method for finding association rules with minimum support and minimum confidence.

The Apriori algorithm consists of two steps.

  1. Create a frequent item set
  2. Create association rule

A Frequent Item Set refers to a set of items with a specific number of occurrences or more for each item.

When generating association rules, all subsets except the empty set of Frequent Item Sets are considered, and among them, association rules exceeding the minimum confidence are found.

Considerations

3.1 Selection of useful association rules

Not all association rules found from association rule analysis will be useful. This is because there are useful rules, rules that are too obvious, and rules that are difficult to relate to.

For example, the rule "men who go to the supermarket on Saturdays tend to buy beer with their baby diapers" is useful information because it can be used directly in a marketing strategy. On the other hand, the information that "customers who buy iPhones tend to buy iPhones" is not a worthwhile rule because it is somewhat obvious. Lastly, it will be difficult to find a correlation with the rule, "A lot of fans are sold in places that sell groceries." In conclusion, it is up to the analyst to decide which rule is useful among the rules obtained through the analysis of the association rules.

3.2 calculation problem

As the number of items increases, the computational amount increases exponentially, and it takes an enormous amount of time to find association rules.

Code

Mount a drive:

from google.colab import drive

drive.mount('/content/gdrive')
Mounted at /content/gdrive

Import packages:

import matplotlib.pyplot as plt
import matplotlib.colors as mcl
import pandas as pd
import numpy as np

from matplotlib.colors import LinearSegmentedColormap
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Load data:

This dataset has 7500 transactions over the course of a week at a French retail store

store_df = pd.read_csv('/content/gdrive/MyDrive/final-project/store_data.csv', header=None)
store_df.head()

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 shrimp almonds avocado vegetables mix green grapes whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt green tea honey salad mineral water salmon antioxydant juice frozen smoothie spinach olive oil
1 burgers meatballs eggs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 chutney NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 turkey avocado NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 mineral water milk energy bar whole wheat rice green tea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Generate frequent item set

records = []
for i in range(len(store_df)):
    records.append([str(store_df.values[i,j]) \
                    for j in range(len(store_df.columns)) if not pd.isna(store_df.values[i,j])])

Creates a dataframe for performing association rule analysis using mlxtend library.

This dataframe will be 1 or 0 if each row contains an item or not.

te = TransactionEncoder()
te_ary = te.fit(records).transform(records, sparse=True)
te_df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
te_df.head()
asparagus almonds antioxydant juice asparagus avocado babies food bacon barbecue sauce black tea blueberries ... turkey vegetables mix water spray white wine whole weat flour whole wheat pasta whole wheat rice yams yogurt cake zucchini
0 0 1 1 0 1 0 0 0 0 0 ... 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

5 rows × 120 columns

The minimum support was 0.005 and the maximum number of item sets was set to three.

frequent_itemset = apriori(te_df,
                           min_support=0.005,
                           max_len=3,
                           use_colnames=True
                          )
frequent_itemset['length'] = frequent_itemset['itemsets'].map(lambda x: len(x))
frequent_itemset.sort_values('support',ascending=False,inplace=True)
frequent_itemset
support itemsets length
60 0.238368 (mineral water) 1
27 0.179709 (eggs) 1
83 0.174110 (spaghetti) 1
33 0.170911 (french fries) 1
20 0.163845 (chocolate) 1
... ... ... ...
646 0.005066 (tomatoes, mineral water, eggs) 3
648 0.005066 (spaghetti, eggs, olive oil) 3
674 0.005066 (soup, mineral water, frozen vegetables) 3
680 0.005066 (grated cheese, mineral water, ground beef) 3
724 0.005066 (pancakes, spaghetti, olive oil) 3

725 rows × 3 columns

The code below is the code that extracts the association rule. Only association rules with a confidence level of 0.005 or higher are extracted.

association_rules_df = association_rules(frequent_itemset,
                                         metric='confidence',
                                         min_threshold=0.005,
                                        )
all_confidences = []
collective_strengths = []
cosine_similarities = []
for _,row in association_rules_df.iterrows():
    all_confidence_if = list(row['antecedents'])[0]
    all_confidence_then = list(row['consequents'])[0]
    if row['antecedent support'] <= row['consequent support']:
        all_confidence_if = list(row['consequents'])[0]
        all_confidence_then = list(row['antecedents'])[0]
    all_confidence = {all_confidence_if+' => '+all_confidence_then : \
                      row['support']/max(row['antecedent support'], row['consequent support'])}
    all_confidences.append(all_confidence)

    violation = row['antecedent support'] + row['consequent support'] - 2*row['support']
    ex_violation = 1-row['antecedent support']*row['consequent support'] - \
                    (1-row['antecedent support'])*(1-row['consequent support'])
    collective_strength = (1-violation)/(1-ex_violation)*(ex_violation/violation)
    collective_strengths.append(collective_strength)

    cosine_similarity = row['support']/np.sqrt(row['antecedent support']*row['consequent support'])
    cosine_similarities.append(cosine_similarity)

association_rules_df['all-confidence'] = all_confidences
association_rules_df['collective strength'] = collective_strengths
association_rules_df['cosine similarity'] = cosine_similarities
association_rules_df.head()
antecedents consequents antecedent support consequent support support confidence lift leverage conviction all-confidence collective strength cosine similarity
0 (spaghetti) (mineral water) 0.174110 0.238368 0.059725 0.343032 1.439085 0.018223 1.159314 {'mineral water => spaghetti': 0.2505592841163... 1.185493 0.293172
1 (mineral water) (spaghetti) 0.238368 0.174110 0.059725 0.250559 1.439085 0.018223 1.102008 {'mineral water => spaghetti': 0.2505592841163... 1.185493 0.293172
2 (chocolate) (mineral water) 0.163845 0.238368 0.052660 0.321400 1.348332 0.013604 1.122357 {'mineral water => chocolate': 0.220917225950783} 1.135588 0.266463
3 (mineral water) (chocolate) 0.238368 0.163845 0.052660 0.220917 1.348332 0.013604 1.073256 {'mineral water => chocolate': 0.220917225950783} 1.135588 0.266463
4 (mineral water) (eggs) 0.238368 0.179709 0.050927 0.213647 1.188845 0.008090 1.043158 {'mineral water => eggs': 0.21364653243847875} 1.076638 0.246056

Results

max_i = 4
for i, row in association_rules_df.iterrows():
    print("Rule: " + list(row['antecedents'])[0] + " => " + list(row['consequents'])[0])

    print("Support: " + str(round(row['support'],2)))

    print("Confidence: " + str(round(row['confidence'],2)))
    print("Lift: " + str(round(row['lift'],2)))
    print("=====================================")
    if i==max_i:
        break
Rule: spaghetti => mineral water
Support: 0.06
Confidence: 0.34
Lift: 1.44
=====================================
Rule: mineral water => spaghetti
Support: 0.06
Confidence: 0.25
Lift: 1.44
=====================================
Rule: chocolate => mineral water
Support: 0.05
Confidence: 0.32
Lift: 1.35
=====================================
Rule: mineral water => chocolate
Support: 0.05
Confidence: 0.22
Lift: 1.35
=====================================
Rule: mineral water => eggs
Support: 0.05
Confidence: 0.21
Lift: 1.19
=====================================

Looking at the first line, the rule that customers who buy 'mineral water' buy 'spaghetti' has a support of 0.06, a confidence of 0.25, and a lift of 1.44. Since lift is greater than 1, it can be interpreted that 'mineral water' and spaghetti have a kind of positive correlation.

The code below is a way to visualize association rule analysis results.

support = association_rules_df['support']
confidence = association_rules_df['confidence']

h = 347
s = 1
v = 1
colors = [
    mcl.hsv_to_rgb((h/360, 0.2, v)),
    mcl.hsv_to_rgb((h/360, 0.55, v)),
    mcl.hsv_to_rgb((h/360, 1, v))
]
cmap = LinearSegmentedColormap.from_list('my_cmap',colors,gamma=2)

measures = ['lift', 'leverage', 'conviction',
            'all-confidence', 'collective strength', 'cosine similarity']

fig = plt.figure(figsize=(15,10))
fig.set_facecolor('white')
for i, measure in enumerate(measures):
    ax = fig.add_subplot(320+i+1)
    if measure != 'all-confidence':
        scatter = ax.scatter(support,confidence,c=association_rules_df[measure],cmap=cmap)
    else:
        scatter = ax.scatter(support,confidence,c=association_rules_df['all-confidence'].map(lambda x: [v for k,v in x.items()][0]),cmap=cmap)
    ax.set_xlabel('support')
    ax.set_ylabel('confidence')
    ax.set_title(measure)

    fig.colorbar(scatter,ax=ax)
fig.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

f4d068a13fbc7cd159ae4c16eab1615c431dbf38

Leave a comment