Association Rule Mining

6 minute read

What is Association Rule Mining?

Association rule mining is a rule-based machine-learning technique designed to discover meaningful relationships between items within extensive databases. This method enables the identification of products that are frequently purchased together with a specific item. By conducting association rule analysis, businesses can optimize product displays, determine effective product bundling strategies, and formulate targeted marketing approaches based on customer purchasing patterns.

How to analyze association rules

In this project, I will use the Apriori algorithm, which is a representative method for finding association rules with minimum support and minimum confidence.

The Apriori algorithm consists of two steps.

Create a frequent item set
Create association rule

A Frequent Item Set refers to a set of items with a specific number of occurrences or more for each item.

When generating association rules, all subsets except the empty set of Frequent Item Sets are considered, and among them, association rules exceeding the minimum confidence are found.

Considerations

3.1 Selection of useful association rules

Not all association rules found from association rule analysis will be useful. This is because there are useful rules, rules that are too obvious, and rules that are difficult to relate to.

For example, the rule "men who go to the supermarket on Saturdays tend to buy beer with their baby diapers" is useful information because it can be used directly in a marketing strategy. On the other hand, the information that "customers who buy iPhones tend to buy iPhones" is not a worthwhile rule because it is somewhat obvious. Lastly, it will be difficult to find a correlation with the rule, "A lot of fans are sold in places that sell groceries." In conclusion, it is up to the analyst to decide which rule is useful among the rules obtained through the analysis of the association rules.

3.2 calculation problem

As the number of items increases, the computational amount increases exponentially, and it takes an enormous amount of time to find association rules.

Code

Mount a drive:

from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive

Import packages:

import matplotlib.pyplot as plt
import matplotlib.colors as mcl
import pandas as pd
import numpy as np

from matplotlib.colors import LinearSegmentedColormap
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Load data:

This dataset has 7500 transactions over the course of a week at a French retail store

store_df = pd.read_csv('/content/gdrive/MyDrive/final-project/store_data.csv', header=None)
store_df.head()

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
0	shrimp	almonds	avocado	vegetables mix	green grapes	whole weat flour	yams	cottage cheese	energy drink	tomato juice	low fat yogurt	green tea	honey	salad	mineral water	salmon	antioxydant juice	frozen smoothie	spinach	olive oil
1	burgers	meatballs	eggs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	chutney	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	turkey	avocado	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	mineral water	milk	energy bar	whole wheat rice	green tea	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Generate frequent item set

records = []
for i in range(len(store_df)):
    records.append([str(store_df.values[i,j]) \
                    for j in range(len(store_df.columns)) if not pd.isna(store_df.values[i,j])])

Creates a dataframe for performing association rule analysis using mlxtend library.

This dataframe will be 1 or 0 if each row contains an item or not.

te = TransactionEncoder()
te_ary = te.fit(records).transform(records, sparse=True)
te_df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
te_df.head()

	almonds	antioxydant juice	avocado	...	turkey	vegetables mix	whole weat flour	whole wheat rice	yams
0	1	1	1	...	0	1	1	0	1
1	0	0	0	...	0	0	0	0	0
2	0	0	0	...	0	0	0	0	0
3	0	0	1	...	1	0	0	0	0
4	0	0	0	...	0	0	0	1	0

5 rows × 120 columns

The minimum support was 0.005 and the maximum number of item sets was set to three.

frequent_itemset = apriori(te_df,
                           min_support=0.005,
                           max_len=3,
                           use_colnames=True
                          )
frequent_itemset['length'] = frequent_itemset['itemsets'].map(lambda x: len(x))
frequent_itemset.sort_values('support',ascending=False,inplace=True)

frequent_itemset

	support	itemsets	length
60	0.238368	(mineral water)	1
27	0.179709	(eggs)	1
83	0.174110	(spaghetti)	1
33	0.170911	(french fries)	1
20	0.163845	(chocolate)	1
...	...	...	...
646	0.005066	(tomatoes, mineral water, eggs)	3
648	0.005066	(spaghetti, eggs, olive oil)	3
674	0.005066	(soup, mineral water, frozen vegetables)	3
680	0.005066	(grated cheese, mineral water, ground beef)	3
724	0.005066	(pancakes, spaghetti, olive oil)	3

725 rows × 3 columns

The code below is the code that extracts the association rule. Only association rules with a confidence level of 0.005 or higher are extracted.

association_rules_df = association_rules(frequent_itemset,
                                         metric='confidence',
                                         min_threshold=0.005,
                                        )
all_confidences = []
collective_strengths = []
cosine_similarities = []
for _,row in association_rules_df.iterrows():
    all_confidence_if = list(row['antecedents'])[0]
    all_confidence_then = list(row['consequents'])[0]
    if row['antecedent support'] <= row['consequent support']:
        all_confidence_if = list(row['consequents'])[0]
        all_confidence_then = list(row['antecedents'])[0]
    all_confidence = {all_confidence_if+' => '+all_confidence_then : \
                      row['support']/max(row['antecedent support'], row['consequent support'])}
    all_confidences.append(all_confidence)

    violation = row['antecedent support'] + row['consequent support'] - 2*row['support']
    ex_violation = 1-row['antecedent support']*row['consequent support'] - \
                    (1-row['antecedent support'])*(1-row['consequent support'])
    collective_strength = (1-violation)/(1-ex_violation)*(ex_violation/violation)
    collective_strengths.append(collective_strength)

    cosine_similarity = row['support']/np.sqrt(row['antecedent support']*row['consequent support'])
    cosine_similarities.append(cosine_similarity)

association_rules_df['all-confidence'] = all_confidences
association_rules_df['collective strength'] = collective_strengths
association_rules_df['cosine similarity'] = cosine_similarities

association_rules_df.head()

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction	all-confidence	collective strength	cosine similarity
0	(spaghetti)	(mineral water)	0.174110	0.238368	0.059725	0.343032	1.439085	0.018223	1.159314	{'mineral water => spaghetti': 0.2505592841163...	1.185493	0.293172
1	(mineral water)	(spaghetti)	0.238368	0.174110	0.059725	0.250559	1.439085	0.018223	1.102008	{'mineral water => spaghetti': 0.2505592841163...	1.185493	0.293172
2	(chocolate)	(mineral water)	0.163845	0.238368	0.052660	0.321400	1.348332	0.013604	1.122357	{'mineral water => chocolate': 0.220917225950783}	1.135588	0.266463
3	(mineral water)	(chocolate)	0.238368	0.163845	0.052660	0.220917	1.348332	0.013604	1.073256	{'mineral water => chocolate': 0.220917225950783}	1.135588	0.266463
4	(mineral water)	(eggs)	0.238368	0.179709	0.050927	0.213647	1.188845	0.008090	1.043158	{'mineral water => eggs': 0.21364653243847875}	1.076638	0.246056

Results

max_i = 4
for i, row in association_rules_df.iterrows():
    print("Rule: " + list(row['antecedents'])[0] + " => " + list(row['consequents'])[0])

    print("Support: " + str(round(row['support'],2)))

    print("Confidence: " + str(round(row['confidence'],2)))
    print("Lift: " + str(round(row['lift'],2)))
    print("=====================================")
    if i==max_i:
        break

Rule: spaghetti => mineral water
Support: 0.06
Confidence: 0.34
Lift: 1.44
=====================================
Rule: mineral water => spaghetti
Support: 0.06
Confidence: 0.25
Lift: 1.44
=====================================
Rule: chocolate => mineral water
Support: 0.05
Confidence: 0.32
Lift: 1.35
=====================================
Rule: mineral water => chocolate
Support: 0.05
Confidence: 0.22
Lift: 1.35
=====================================
Rule: mineral water => eggs
Support: 0.05
Confidence: 0.21
Lift: 1.19
=====================================

Looking at the first line, the rule that customers who buy 'mineral water' buy 'spaghetti' has a support of 0.06, a confidence of 0.25, and a lift of 1.44. Since lift is greater than 1, it can be interpreted that 'mineral water' and spaghetti have a kind of positive correlation.

The code below is a way to visualize association rule analysis results.

support = association_rules_df['support']
confidence = association_rules_df['confidence']

h = 347
s = 1
v = 1
colors = [
    mcl.hsv_to_rgb((h/360, 0.2, v)),
    mcl.hsv_to_rgb((h/360, 0.55, v)),
    mcl.hsv_to_rgb((h/360, 1, v))
]
cmap = LinearSegmentedColormap.from_list('my_cmap',colors,gamma=2)

measures = ['lift', 'leverage', 'conviction',
            'all-confidence', 'collective strength', 'cosine similarity']

fig = plt.figure(figsize=(15,10))
fig.set_facecolor('white')
for i, measure in enumerate(measures):
    ax = fig.add_subplot(320+i+1)
    if measure != 'all-confidence':
        scatter = ax.scatter(support,confidence,c=association_rules_df[measure],cmap=cmap)
    else:
        scatter = ax.scatter(support,confidence,c=association_rules_df['all-confidence'].map(lambda x: [v for k,v in x.items()][0]),cmap=cmap)
    ax.set_xlabel('support')
    ax.set_ylabel('confidence')
    ax.set_title(measure)

    fig.colorbar(scatter,ax=ax)
fig.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

f4d068a13fbc7cd159ae4c16eab1615c431dbf38

Share on

Twitter Facebook LinkedIn

Sungbin Lee

Association Rule Mining

What is Association Rule Mining?

How to analyze association rules

Considerations

Code

Results

Share on

Leave a comment

You may also enjoy

Flow Production Tracking Toolkit 요약

PostgreSQL Meetup Seoul #2 밋업 후기

“연결을 거부했습니다” 오류 해결기: Django 서비스 접속 불가 디버깅 사례(Feat. Nginx)

파이썬(PySide6) 스크린샷 도구(Snipping tool) 개발 과정에서의 문제 해결

	almonds	antioxydant juice	avocado	...	turkey	vegetables mix	whole weat flour	whole wheat rice	yams
0	1	1	1	...	0	1	1	0	1
1	0	0	0	...	0	0	0	0	0
2	0	0	0	...	0	0	0	0	0
3	0	0	1	...	1	0	0	0	0
4	0	0	0	...	0	0	0	1	0

	almonds	antioxydant juice	avocado	...	turkey	vegetables mix	whole weat flour	whole wheat rice	yams
0	1	1	1	...	0	1	1	0	1
1	0	0	0	...	0	0	0	0	0
2	0	0	0	...	0	0	0	0	0
3	0	0	1	...	1	0	0	0	0
4	0	0	0	...	0	0	0	1	0

	almonds	antioxydant juice	avocado	...	turkey	vegetables mix	whole weat flour	whole wheat rice	yams
0	1	1	1	...	0	1	1	0	1
1	0	0	0	...	0	0	0	0	0
2	0	0	0	...	0	0	0	0	0
3	0	0	1	...	1	0	0	0	0
4	0	0	0	...	0	0	0	1	0