Skip to content

Association Rule Mining

Association rule mining is a fundamental technique in data mining focused on discovering relationships between items in transaction data. These rules reveal correlations that provide valuable insights for market basket analysis, recommendation systems, and business decision-making, as well as significant applications in computational biology for identifying gene associations and in population health for discovering disease co-occurrence patterns.

Association Rules

Association rules represent relationships between sets of items in transaction data. A rule takes the form "if A, then B" (written as A → B), where A is the antecedent (left-hand side) and B is the consequent (right-hand side). For example, in retail analysis, a rule {bread, butter} → {milk} suggests that customers who purchase bread and butter are likely to purchase milk as well.

Formal Definition

Let:

  • I={i1,i2,...,in} be the set of all items in the dataset

  • D={T1,T2,...,Tm} be the set of all transactions, where each TjI

  • A,BI and AB=

An association rule is an implication of the form AB, where:

  • A is called the antecedent (or left-hand side)

  • B is called the consequent (or right-hand side)

Mining Process

Let σmin and χmin be user-defined minimum thresholds for support and confidence, respectively.

Then, the set of all valid association rules (AR) can be defined as:

AR={(AB)A,BIAB=σ(AB)σminχ(AB)χmin}

The process of association rule mining involves:

  1. Finding all frequent itemsets F={ZIσ(Z)σmin}

  2. For each frequent itemset ZF, generate all non-empty subsets AZ

  3. For each such subset A, form the rule A(ZA) if χ(A(ZA))χmin

Rule Evaluation Metrics

The main challenge in association rule mining is efficiently discovering meaningful rules from large datasets while filtering out weak or uninteresting patterns using various interestingness measures.

To evaluate the strength and significance of association rules, several key metrics are used:

Support

Support measures how frequently the itemset (A ∪ B) appears in the dataset.

σ(AB)=|TjD:ABTj||D|

Confidence

Confidence measures how often the rule is found to be true. It represents the conditional probability of finding B given that a transaction contains A.

χ(AB)=σ(AB)σ(A)

Coverage

Coverage (sometimes called support of A) measures how often A appears in the dataset, regardless of B.

γ(AB)=σ(A)=|TjD:ATj||D|

Lift

Lift measures how much more likely B is to be present when A is present, compared to when A is absent. It indicates the strength of association beyond random co-occurrence.

L(AB)=σ(AB)σ(A)σ(B)