Getting Started
This guide will walk you through installing RuleMiner.jl, setting up multithreading for optimal performance, and running your first association rule mining analysis. By the end, you'll be ready to discover patterns in your own transaction data.
Installation
Julia packages can be added from the general registry using the Pkg
package manager in the Julia REPL or in a Julia script.
include Pkg
Pkg.add("RuleMiner")
Enabling Multithreading
The mining algorithms in RuleMiner utilize multithreading to process data concurrently, which can lead to dramatic performance improvements, especially for larger datasets or lower support thresholds. Multithreading must be enabled when launching julia to take advantage of these performance gains.
To enable multithreading when launching Julia from a terminal, use the -t
argument:
julia -t auto # Use all available CPU cores
julia -t 8 # Dedicate 8 cores specifically
It is recommended to use the auto
setting with RuleMiner.jl for best performance.
If you're using the official Julia extension for VS Code, the parameter passed into -t is called julia.NumThreads
in VS Code settings for all Julia instances created in that IDE. Other IDEs may have similar ways to configure the arguments passed into Julia on startup.
You can verify the number of threads available to Julia with:
using Base.Threads
println("Julia is using $(nthreads()) threads")
Creating a Txns
object
The first step in mining association rules in RuleMiner is to create a Txns
object. This stores the information about the transactions in a format that can be easily mined. These can be created either by reading a basket-format file or by converting an existing 1-hot encoded dataframe.
Generally, loading from files is more efficient and can handle larger datasets, as data is read directly into a sparse stroage format.
From a file
This example demonstrates reading the "mushrooms" dataset that is commonly used for data mining examples and benchmarking hosted by the University of Antwerp FIMI Project.
# Download the data to a local file
using Downloads
Downloads.download("http://fimi.uantwerpen.be/data/mushroom.dat", "mushrooms.txt")
# Read the data into a Txns object, specifying a single space as the delimiter
mushrooms = Txns("mushrooms.txt", ' ')
println(mushrooms)
Txns with 8124 transactions, 119 items, and 186852 non-zero elements
Index │ Items
───────┼────────────────────────────────────────────────────────────────────────
1 │ 1, 3, 9, 13, 23, 25, 34, 36, 38, 40, 52, 54, 59, 63, 67, 76, 85, 86,…
2 │ 3, 9, 23, 34, 36, 40, 52, 59, 63, 67, 76, 85, 86, 90, 93, 2, 14, 26,…
3 │ 9, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 2, 39, 55, 99,…
4 │ 1, 3, 23, 25, 34, 36, 38, 52, 54, 59, 63, 67, 76, 85, 86, 90, 93, 98,…
5 │ 3, 9, 34, 40, 54, 59, 63, 67, 76, 85, 86, 90, 2, 39, 99, 114, 16, 24,…
6 │ 3, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 98, 2, 14, 26, 39,…
7 │ 9, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 98, 2, 26, 39, 55,…
8 │ 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 107, 2, 39, 55, 99,…
⋮ │ ⋮
8117 │ 1, 13, 34, 36, 38, 59, 76, 85, 86, 90, 10, 24, 53, 94, 110, 69, 66,…
8118 │ 1, 9, 34, 36, 38, 63, 76, 85, 86, 90, 24, 53, 94, 110, 116, 69, 17,…
8119 │ 1, 13, 34, 36, 38, 63, 76, 85, 86, 90, 10, 24, 53, 94, 110, 116, 69,…
8120 │ 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 58, 112, 119, 7,…
8121 │ 3, 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 110, 58, 119,…
8122 │ 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 41, 24, 28, 6, 58, 112,…
8123 │ 1, 13, 34, 36, 38, 59, 67, 76, 85, 86, 90, 10, 24, 53, 94, 110, 66,…
8124 │ 3, 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 58, 112, 119,…
From a DataFrame
Say the mushrooms data was in a 1-hot tabular format instead of a basket file format. By creating a DataFrame where the item labels are the column headers and the columns are boolean 1-hot encodings, these can also be easily converted into Txns objects.
The original 1-hot encoded dataframe:
# print first 10 rows of the mushroom df
println(first(mushroom_df,10))
10×119 DataFrame
Row │ 1 3 9 13 23 25 34 36 38 40 52 54 59 63 67 76 85 86 90 93 98 107 113 2 14 26 39 55 99 108 114 4 15 27 41 115 10 16 24 28 37 53 94 109 42 43 110 44 11 64 5 111 6 56 116 57 65 117 100 60 45 68 77 69 78 46 17 29 61 66 70 79 95 101 71 18 30 80 19 47 58 72 91 102 112 118 31 48 20 96 119 103 21 7 81 22 32 82 12 8 49 35 50 73 83 87 51 88 104 33 74 84 92 97 105 106 62 75 89
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 │ 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 │ 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 │ 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 │ 0 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 │ 0 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 │ 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 │ 0 0 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 │ 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 │ 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The Txns object:
mushrooms = Txns(mushroom_df)
println(mushrooms)
Txns with 8124 transactions, 119 items, and 186852 non-zero elements
Index │ Items
───────┼────────────────────────────────────────────────────────────────────────
1 │ 1, 3, 9, 13, 23, 25, 34, 36, 38, 40, 52, 54, 59, 63, 67, 76, 85, 86,…
2 │ 3, 9, 23, 34, 36, 40, 52, 59, 63, 67, 76, 85, 86, 90, 93, 2, 14, 26,…
3 │ 9, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 2, 39, 55, 99,…
4 │ 1, 3, 23, 25, 34, 36, 38, 52, 54, 59, 63, 67, 76, 85, 86, 90, 93, 98,…
5 │ 3, 9, 34, 40, 54, 59, 63, 67, 76, 85, 86, 90, 2, 39, 99, 114, 16, 24,…
6 │ 3, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 98, 2, 14, 26, 39,…
7 │ 9, 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 98, 2, 26, 39, 55,…
8 │ 23, 34, 36, 52, 59, 63, 67, 76, 85, 86, 90, 93, 107, 2, 39, 55, 99,…
⋮ │ ⋮
8117 │ 1, 13, 34, 36, 38, 59, 76, 85, 86, 90, 10, 24, 53, 94, 110, 69, 66,…
8118 │ 1, 9, 34, 36, 38, 63, 76, 85, 86, 90, 24, 53, 94, 110, 116, 69, 17,…
8119 │ 1, 13, 34, 36, 38, 63, 76, 85, 86, 90, 10, 24, 53, 94, 110, 116, 69,…
8120 │ 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 58, 112, 119, 7,…
8121 │ 3, 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 110, 58, 119,…
8122 │ 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 41, 24, 28, 6, 58, 112,…
8123 │ 1, 13, 34, 36, 38, 59, 67, 76, 85, 86, 90, 10, 24, 53, 94, 110, 66,…
8124 │ 3, 9, 13, 36, 52, 59, 63, 85, 90, 93, 2, 39, 24, 28, 58, 112, 119,…
Mining association Rules
Using the created Txns
object, association rules can be mined with the apriori
function. These examples use the same toy data from the dataframe example above.
Example 1: Relative support
Only minimum support is required by the apriori function, but a minimum confidence can also be specified. In this example, apriori is being used to search for rules with a minimum support of 80% and a minimum confidence of 90%.
rules = apriori(mushrooms, 0.8, 0.9)
println(first(rules,15))
15×8 DataFrame
Row │ LHS RHS Support Confidence Coverage Lift N Length
│ Array… String Float64 Float64 Float64 Float64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────────
1 │ String[] 34 0.974151 0.974151 1.0 1.0 7914 1
2 │ String[] 85 1.0 1.0 1.0 1.0 8124 1
3 │ String[] 86 0.975382 0.975382 1.0 1.0 7924 1
4 │ String[] 90 0.921713 0.921713 1.0 1.0 7488 1
5 │ ["34"] 90 0.89808 0.921911 0.974151 1.00021 7296 2
6 │ ["90"] 34 0.89808 0.974359 0.921713 1.00021 7296 2
7 │ ["36"] 34 0.812654 0.969172 0.838503 0.994889 6602 2
8 │ ["85"] 90 0.921713 0.921713 1.0 1.0 7488 2
9 │ ["90"] 85 0.921713 1.0 0.921713 1.0 7488 2
10 │ ["86"] 90 0.897095 0.919738 0.975382 0.997856 7288 2
11 │ ["90"] 86 0.897095 0.973291 0.921713 0.997856 7288 2
12 │ ["36"] 85 0.838503 1.0 0.838503 1.0 6812 2
13 │ ["36"] 86 0.81487 0.971814 0.838503 0.996343 6620 2
14 │ ["85"] 34 0.974151 0.974151 1.0 1.0 7914 2
15 │ ["34"] 85 0.974151 1.0 0.974151 1.0 7914 2
Example 2: Absolute Support
Specifying an integer value for min_support
is used to indicate absolute support, rather than relative support.
Here, apriori is used to search for rules which occur at least 6500 transactions in the data. The results are identical to the above example because the min_support threshold of 6500 (out of 8124) is approximately equal to the 80% threshhold in the last example.
rules = apriori(mushrooms, 6500, 0.9)
println(first(rules,15))
15×8 DataFrame
Row │ LHS RHS Support Confidence Coverage Lift N Length
│ Array… String Float64 Float64 Float64 Float64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────────
1 │ String[] 34 0.974151 0.974151 1.0 1.0 7914 1
2 │ String[] 85 1.0 1.0 1.0 1.0 8124 1
3 │ String[] 86 0.975382 0.975382 1.0 1.0 7924 1
4 │ String[] 90 0.921713 0.921713 1.0 1.0 7488 1
5 │ ["34"] 90 0.89808 0.921911 0.974151 1.00021 7296 2
6 │ ["90"] 34 0.89808 0.974359 0.921713 1.00021 7296 2
7 │ ["36"] 34 0.812654 0.969172 0.838503 0.994889 6602 2
8 │ ["85"] 90 0.921713 0.921713 1.0 1.0 7488 2
9 │ ["90"] 85 0.921713 1.0 0.921713 1.0 7488 2
10 │ ["86"] 90 0.897095 0.919738 0.975382 0.997856 7288 2
11 │ ["90"] 86 0.897095 0.973291 0.921713 0.997856 7288 2
12 │ ["36"] 85 0.838503 1.0 0.838503 1.0 6812 2
13 │ ["36"] 86 0.81487 0.971814 0.838503 0.996343 6620 2
14 │ ["85"] 34 0.974151 0.974151 1.0 1.0 7914 2
15 │ ["34"] 85 0.974151 1.0 0.974151 1.0 7914 2
Results analysis
Even from this very preliminary analysis, an interesting pattern emerges.
Item 85 by its self has a support of 1.0, meaning it appears in every transaction in the dataset. This has some interesting implications, like any rule where 85 is the RHS will have a confidence value of 1.0.
The reason for this is because confidence calculates the percentage of transactions in the LHS set that also apear in the LHS ∪ RHS set. If the RHS item appears in every transaction then all LHS transactions are guaranteed to have the RHS item
Next Steps
Hopefully, you feel prepared to used RuleMiner load data and to mine association rules!
To find good sample data sets to try out RuleMiner, check out the FIMI repository.
If you want to dig deeper into the other capabilities of RuleMiner, check out the RuleMiner API Reference.