Masoud Rabiei        About    Archive    Feed

Analyzing Public Recipe Data in R

In this post, we are going to use a publicly available recipe dataset to answer one simple question: Cuisine from which countries are the most similar? There are of course various ways to answer this question but here I’ll be using a rather simplistic view and focus only on the ingredients used in each recipe. Let’s get started by loading and exploring our dataset.

The Dataset

The dataset we are working with is a collection recipes scraped from epicurious.com where each record contains the type of the cuisine (e.g. American, French, Chinese) and the ingredients that were used in the recipe.

The data is stored in flat text files where each line is one record. Here’s the first two lines of the raw data file:

[1] "Vietnamese\tvinegar\tcilantro\tmint\tolive_oil\tcayenne\tfish\tlime_juice\tshrimp\tlettuce\tcarrot\tgarlic\tbasil\tcucumber\trice\tseed\tshiitake"
[2] "Vietnamese\tonion\tcayenne\tfish\tblack_pepper\tseed\tgarlic"

Before we can use this data, we need to read the data into R and convert it to a data frame. First, we combine all the recipes from the same cuisine type together to create a food corpus.

library(dplyr)
# read each line
dat <- readLines("../data/epic_recipes.txt") %>% strsplit(split = "\t")
# extract cuisine and ingredients
cuisine <- sapply(dat, function(x) x[1])
ingredients <- sapply(dat, function(x) paste(x[-1],collapse = " ")) 
# for each cuisine, join together the ingredients from all the recipes
cuisine_names <- cuisine %>% unique
names(cuisine_names) <- cuisine_names
food <- sapply(cuisine_names, function(x){
  ingredients[cuisine == x] %>% paste(collapse = " ")
  })

The next step is to create a feature matrix from this corpus. One simple way to create features is to represent each cuisine by the number of times each ingredient has been used in it. This is essentially the standard bag-of-words representation often used in text analytics.

library(tm)
# convert text corpus to a document term matrix
food_dtm <- VectorSource(food) %>% VCorpus %>% DocumentTermMatrix
# create a data frame from the document term matrix
food_df <- data.frame(cuisine = cuisine_names, stringsAsFactors=FALSE, row.names=NULL) %>% 
  cbind(as.matrix(food_dtm)) %>% tbl_df

Here’s what the data looks like:

Source: local data frame [6 x 7]

                cuisine almond anise anise_seed apple apple_brandy apricot
1            Vietnamese      0     0          0     0            0       0
2                Indian      8     0          2     6            0       6
3    Spanish_Portuguese     27     1          3     3            0       1
4                Jewish     31     0          1    28            0      21
5                French     65    12          6    39           10      16
6 Central_SouthAmerican     13     0          5     0            0       0

The data is currently in what is known as wide format (i.e. each feature is represented as one column). To make our lives easier later on (when plotting the data), let’s create a tall version of the data as well:

library(tidyr)
# "wide" --> "tall"
food_tall <- food_df %>% gather("ingredient", "count", -cuisine)

What are the most popular ingredients in each type of cuisine? We can use the feature matrix that we just created to answer this question.

library(ggplot2)

N_top <- 10
# calculate the ranking of each ingredient for each cuisine type
food_tall <- food_tall %>% group_by(cuisine) %>% 
  mutate(rank = min_rank(desc(count)))

food_tall %>%
  filter(rank<=N_top) %>% # top N ingredients
  ggplot +
  geom_point(aes(ingredient, cuisine, color = rank, size = rank)) +
  scale_color_gradient(name = "Popularity", breaks=1:N_top,
                       low="red", high="black") +
  scale_size(name = "Popularity", breaks=1:N_top, range=c(6,1)) + 
  guides(color = guide_legend()) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5, size = 11))

center Looking across columns, we see that onion, garlic, wheat and butter are among the most popular ingredients in almost all types of cuisine. Looking across rows, we see that Parmesan cheese, for example, is a popular ingredient for Italian cuisine only and not the others.

We can also plot the number of unique ingredients found in each cuisine:

recipe_count <- table(cuisine) %>% as.data.frame.table(responseName = "recipe_count")

# Number of unique ingredients
food_tall %>% 
  summarise(unique_ingredients = sum(count>0) ) %>% 
  arrange(desc(unique_ingredients)) %>% 
  mutate(cuisine = factor(cuisine,
                          levels = cuisine[order(rev(unique_ingredients))])) %>% 
  ggplot + geom_bar(aes(cuisine, unique_ingredients),
                    stat = "identity", fill = "gold",
                    color = "brown", width = .7) +
  ylab("Number of unique ingredients") +
  theme(axis.text.x = element_text(angle=90, vjust = .5, hjust = 1))

center American, French and Asian foods have the highest number of unique ingredients. We may be tempted to use the number of unique ingredients as a measure of complexity for each cuisine but that is not necessarily true. Let’s take a look at the number of recipes we have for each cuisine and compare that against the number of unique ingredients:

food_tall %>% left_join(recipe_count) %>% 
  mutate(unique_ingredients = sum(count>0) ) %>% 
  arrange(desc(unique_ingredients)) %>% 
  qplot(recipe_count, unique_ingredients, data = .,
        size = I(3), color = I("brown")) +
  scale_x_log10() +
  xlab("Number of recipes in the dataset") +
  ylab("Number of unique ingredients")

center There is a high correlation between recipe counts and the number of unique ingredients. So perhaps the reason some of the cuisines have fewer unique ingredients is because we just haven’t seen collected enough recipes yet.

Hierarchical Clustering

We now have everything we need to try and answer the question that motivated our analysis. We want to know which cuisine are the most similar in terms of their ingredients. We can use a clustering algorithm and divide our samples into different clusters in an unsupervised fashion. In a problem like this where we don’t have a predefined number of clusters in mind, hierarchical clustering is a good option. The algorithm starts by assigning each sample to its own cluster and then proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

hc <- data.frame(food_df, row.names = "cuisine")

# without standardizing
hc %>% dist %>% hclust %>% plot(hang= -1, xlab = "", main="Clustering without standardization", sub="")

center

One way to visualize the clustering results is using a dendogram (we can also use graphs as we will see next). The height of the dendogram is the distance (dissimilarity) between samples or joined clusters. By default, the dist() function in R uses a Euclidean distance measure which gets dominated by the variables with highest variance. To deal with this, we need to standardize our data prior to calculating the dissimilarity matrix. Intuitively, each row of the standardized data will contain numbers between 0 and 1 that capture the contribution of each ingredient to a given cuisine.

library(dendextend) # pretty dendograms

mar_orig <- par("mar")
par(mar=c(11,5,2,2)) # set the margines

# need to standardize rows prior to calculating dissimilarity
norm_rows <- function(mat){
  mat / rowSums(mat)
  }

# distance matrix
d <- hc %>% 
  norm_rows %>%
  dist

food_dend <- hclust(d) %>% as.dendrogram

num_clusters <- 4 # highlight four clusters
food_dend %>% set("branches_k_color", k = num_clusters) %>% plot(ylab="Distance")

# add bounding boxes
food_dend %>% rect.dendrogram(k=num_clusters, border = 8, lty = 5, lwd = 2, lower_rect=0)

center

Now, this is much better! It looks nicer and it makes much more sense. All the cuisines that we know as being similar are nicely grouped together in this dendogram. The hierarchy and how smaller clusters are joined to make bigger ones is also interesting. For example, we see that Moroccan and African foods are (obviously) very similar and together they form a cluster that is similar to Middle Eastern food.

It is also interesting how French cuisine turns out to be the most similar to American cuisine. Sounds surprising? Remember that our analysis is based only on the ingredients used in the recipes and nothing else. We can examine the raw counts for the top ten ingredients to confirm:

food_tall %>% filter(cuisine %in% c("American", "French")) %>%
  group_by(cuisine) %>% mutate(rank = min_rank(desc(count))) %>% 
  filter(rank <= 10) %>% 
  ungroup %>% spread(cuisine, count, fill="—") %>% 
  arrange(rank) %>% kable
ingredient rank American French
butter 1 2219 488
egg 2 1738 433
wheat 3 1574 355
olive_oil 4 1512 307
garlic 5 1267 272
cream 6 269
onion 6 1254
cream 7 1238
onion 7 236
black_pepper 8 1154 205
milk 9 942 193
parsley 9 193
vegetable_oil 10 926

The top ten ingredients in French and American cuisine are indeed very similar.

Using D3 forced network to visualize the clusters

We can also use a graph to visualize the pairwise distance matrix that we calculated and used as input for clustering.

We are going to use a force-directed network graph in D3 for this visualization. Each node in the graph represents one cuisine and the length of the edges connecting these nodes is proportional to the distance between nodes. To better “uncover” the hidden clusters in the data, I pruned the graph by removing the edges between nodes that are sufficiently far from each other (only keeping the top 40% of edges).

The graph representation makes it easier to visually detect clusters in the data. For comparison, the nodes of the graph are colored based on the four clusters that we previously identified using hierarchical clustering.

library(d3Network)
nodes <- as.matrix(d) %>% 
  create_nodes(Group = cutree(food_dend, k = num_clusters))

links <-  as.matrix(d) %>% 
  create_links

# prune the graph. Remove 60% of the links based on distance.
links <- links %>% filter(Value < quantile(d,.4))

d3ForceNetwork(Links = links, Nodes = nodes, Source = "Source",
               Target = "Target", Value = "Value", NodeID = "Name",
               Group = "Group",
               linkDistance = "function(d) { return 1000 * d.value + 10; }",
               linkWidth = "function(d) { return 1/(10*d.value); }",
               width = 650, height = 500,
               charge = -250,
               opacity = 1, 
               standAlone = FALSE,
               parentElement = "div#cuisine-forcenet")

Feel free to checkout the source for this post to see all the code for preparing the data for this D3 visualization.

Recent posts