-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
137 lines (104 loc) · 4.94 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "README"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 11, fig.height = 9)
```
\#`Spark` \#`R` \#`iGraph`
## Background
[Instacart](https://www.instacart.com/), an American company who provides groceries delivery service releases their shopping dataset publicly. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. Over here, we will simply do a data exploratory using association rule and graph techniques.
## Import Data to Spark
To facilitate the data processing (~ 500MB), we run a local Spark cluster on our machine through `SparklyR`.
```{r, warning=FALSE, message=FALSE}
library(sparklyr)
library(dplyr)
library(readr)
library(purrr)
library(igraph)
library(visNetwork)
# Spark properties
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$`spark.memory.fraction` <- 0.9
sc <- spark_connect(master = "local", version = "2.2.0", config = conf)
```
## Frequent Pattern Mining
With Spark activated, we read our data into a Spark dataframe. Spark has built-in [FP Growth](https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm) algorithm implementation. All we need to specify hyperparameters mininum confidence and mininum support. For more about the algorithm, see [here](https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm).
Note: You need at least SparklyR **version 2.2.0** for FP Growth algorithm implementation.
```{r, warning=FALSE}
# this is our data from Instacart
orders <- spark_read_csv(sc, "orders", "instacart_2017_05_01/order_products__prior.csv")
# group items by order (purchase history)
orders_wide <- orders %>%
group_by(order_id) %>%
summarise(items = collect_list(product_id))
```
```{r, cache=TRUE, warning=FALSE}
# invoke FP Growth implementation
fpg.fit <- ml_fpgrowth(orders_wide, items_col = "items", min_confidence = .015, min_support = .005)
rules <- ml_association_rules(fpg.fit) %>% collect()
```
This is basically it. We have a list showing which item is associated with what, as in `{antecedent}--{consequent}`.
```{r, warning=FALSE}
# collect our rules into a data frame
asso <-
tibble(
antecedent = unlist(rules$antecedent),
consequent = unlist(rules$consequent),
confidence = rules$confidence
)
head(asso)
# remember to close Spark connection
spark_disconnect_all()
```
## Visualizing Result
Fundamentally, what we have done so far is to create a network of 'food', `{A}--{B}` and `{B}--{C}` and etc. We will proceed to visualize it using `igraph` library. Please note that the width of edge or relationship between nodes signifies the confidence. With food {A} usually associates with food {B}, the other way around may not be true. If someone bought a salmon, he or she may also buy a lemon. For someone who buys lemon, to buy salmon as well may not be necessary.
```{r, warning=FALSE, message=FALSE}
# get product names
products <- read_csv("instacart_2017_05_01/products.csv")
# bind to nodes
nodes <- data.frame(id = unique(asso$antecedent, asso$consequent)) %>%
distinct() %>%
left_join(products, by = c("id" = "product_id")) %>%
select(id, label = product_name)
edges <- asso %>% mutate(weight = confidence * 10)
df.g <- graph_from_data_frame(edges, directed = TRUE, vertices = nodes)
plot(
df.g,
edge.arrow.size = .5,
edge.curved = .3,
edge.width = edges$weight,
vertex.color = "lightblue",
vertex.label.color = "darkblue",
vertex.label.cex = .7,
edge.label.cex = .7
)
```
Another way of visualizing the network is through interactive plotting library `VisNetwork`. The code is shown below but the result won't be displayed in static HTML here.
```{r, eval=FALSE}
nodes <- data.frame(id = unique(asso$antecedent, asso$consequent)) %>%
distinct() %>%
left_join(products, by = c("id" = "product_id")) %>%
select(id, label = product_name)
edges <- asso %>%
mutate(width = confidence * 20,
smooth = TRUE, arrows = "to",
label = format(confidence, digits = 2)) %>%
rename(from = antecedent, to = consequent)
visNetwork(nodes, edges, height = "800px", width = "100%")
```
## Cluster Detection
With a graph in place, it is easy to proceed with more interesting discovery. With merely two lines of code, we can cluster items using graph community detection technique. Awesome.
```{r, warning=FALSE}
# community structure detection requires no directed graph
df.g.sym <- as.undirected(df.g, mode = "collapse", edge.attr.comb = list(weight = "sum", "ignore"))
ceb <- cluster_edge_betweenness(df.g.sym)
#dendPlot(ceb, mode = "hclust")
plot(ceb, df.g.sym)
```
## Data Source
“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 2019-05-22.