Association Rule Mining

Association rule mining is a fundamental technique in data mining focused on discovering relationships between items in transaction data. These rules reveal correlations that provide valuable insights for market basket analysis, recommendation systems, and business decision-making, as well as significant applications in computational biology for identifying gene associations and in population health for discovering disease co-occurrence patterns.

Association Rules

Association rules represent relationships between sets of items in transaction data. A rule takes the form "if A, then B" (written as A → B), where A is the antecedent (left-hand side) and B is the consequent (right-hand side). For example, in retail analysis, a rule {bread, butter} → {milk} suggests that customers who purchase bread and butter are likely to purchase milk as well.

Formal Definition

Let:

$I = {i_{1}, i_{2}, . . ., i_{n}}$ be the set of all items in the dataset
$D = {T_{1}, T_{2}, . . ., T_{m}}$ be the set of all transactions, where each $T_{j} \subseteq I$
$A, B \subseteq I$ and $A \cap B = \emptyset$

An association rule is an implication of the form $A \Rightarrow B$ , where:

$A$ is called the antecedent (or left-hand side)
$B$ is called the consequent (or right-hand side)

Mining Process

Let $σ_{m i n}$ and $χ_{m i n}$ be user-defined minimum thresholds for support and confidence, respectively.

Then, the set of all valid association rules (AR) can be defined as:

A R = {(A \Rightarrow B) ∣ A, B \subseteq I \land A \cap B = \emptyset \land σ (A \Rightarrow B) \geq σ_{m i n} \land χ (A \Rightarrow B) \geq χ_{m i n}}

The process of association rule mining involves:

Finding all frequent itemsets $F = {Z \subseteq I ∣ σ (Z) \geq σ_{m i n}}$
For each frequent itemset $Z \in F$ , generate all non-empty subsets $A \subset Z$
For each such subset $A$ , form the rule $A \Rightarrow (Z ∖ A)$ if $χ (A \Rightarrow (Z ∖ A)) \geq χ_{m i n}$

Rule Evaluation Metrics

The main challenge in association rule mining is efficiently discovering meaningful rules from large datasets while filtering out weak or uninteresting patterns using various interestingness measures.

To evaluate the strength and significance of association rules, several key metrics are used:

Support

Support measures how frequently the itemset (A ∪ B) appears in the dataset.

σ (A \Rightarrow B) = \frac{| T_{j} \in D : A \cup B \subseteq T_{j} |}{| D |}

Confidence

Confidence measures how often the rule is found to be true. It represents the conditional probability of finding B given that a transaction contains A.

χ (A \Rightarrow B) = \frac{σ (A \cup B)}{σ (A)}

Coverage

Coverage (sometimes called support of A) measures how often A appears in the dataset, regardless of B.

γ (A \Rightarrow B) = σ (A) = \frac{| T_{j} \in D : A \subseteq T_{j} |}{| D |}

Lift

Lift measures how much more likely B is to be present when A is present, compared to when A is absent. It indicates the strength of association beyond random co-occurrence.

L (A \Rightarrow B) = \frac{σ (A \cup B)}{σ (A) \cdot σ (B)}

Association Rule Mining ​

Association Rules ​

Formal Definition ​

Mining Process ​

Rule Evaluation Metrics ​

Support ​

Confidence ​

Coverage ​

Lift ​