Introduction

I’ve been working on my R package, tyecon for some time now. I think it can be a useful tool in performing day to day data analysis tasks. The idea is very simple: More higher order macros. There’s no reason to focus on the handful of existing ones, like the magrittr pipes or the dplyr syntax. We have a language here that supports lazy evaluation and first class functions. This is all we need to make programming life easier.

Data analysis tasks are quite unique. And it’s not always clear what the best approach towards writing less code could be. Nevertheless, I think searching for such possibilities is a legitimate endeavour, hence tyecon.

In this first part of the “Journey through tyecon” series, I explore the tags found in the Food.com dataset. I forgo mentioning specifically when I’m using any tyecon functionality, but that should be obvious from the code.

Setup and Data Import

library(tidyverse)
library(tidyselect)
library(magrittr)
library(glue)
library(vroom)
library(rlang)
library(tyecon)

knitr::opts_chunk$set(fig.path = "")
knitr::opts_chunk$set(dev = 'svg')
theme_set(theme_light())
set.seed(123)
recipes <- vroom("~/Workspace/foodrecipes/RAW_recipes.csv")
interacts <- vroom("~/Workspace/foodrecipes/RAW_interactions.csv")

Tags

tags <- recipes %->% {
  select(id, tags)
  mutate(tag = str_replace_all(tags, "[\\[\\]\\']", ""))
  select(-tags)
  mutate(tag = ifelse(tag == "", NA, tag))
  separate_rows(tag, sep = ", ")
  # convert to conventional R names
  mutate(tag = make.names(tag))
}

Number of tags density plot:

tags %->% {
  group_by(id)
  summarise(n_tags = n())
} %$% {
  quantprob <- 0.9
  quant <- quantile(n_tags, quantprob)
  ggplot(., aes(n_tags)) +
    geom_density() +
    geom_vline(xintercept = quant) +
    annotate("label",
      quant, 0, label = glue("Quantile probability {quantprob}:\n {quant}")
    ) +
    xlab("Number of tags") +
    ylab("Density")
}

plot of chunk quantile

Therefore, We would still need a lot of tags for basic exploration of our data. Let us first create the tag dummies for each submitted recipe:

tags_dummies <- tags %->% {
  add_count(tag)
  arrange(desc(n))
  select(-n)
  mutate(tagged = 1)
  pivot_wider(names_from = tag, values_from = tagged, values_fill = 0)
  # keep id as metadata
  ids <- id
  select(-id)
  set_attr("ids", ids)
}

The columns of the above data frame are thus sorted in order of tag frequency.

The strategy is to identify how much of the data is covered by a list of tags, and to append the list with a new tag and recompute the coverage rate. Starting from highly popular tags quickly culminates in a coverage rate of 1, so we are instead interested in whether a less popular subset still covers all of our observations.

Consequently, let’s assume a coverage rate of 1 for the first few tags and then compute the coverage rate from thereon:

tags_coverage <- tags_dummies %->% {
  start <- 11
  coverage_name <- as.character(glue("coverage_{start}"))
  head_names <- names(.)[1:start - 1]
  extract(start:100)
  nrows <- nrow(.)
  accumulate(`+`)
  map_dfc(~ length(which(. != 0)) / nrows)
  pivot_longer(everything(), values_to = coverage_name)
  { bind_rows(tibble(name = head_names, "{coverage_name}" := 1), .) }
}
tail(tags_coverage)
## # A tibble: 6 × 2
##   name         coverage_11
##   <chr>              <dbl>
## 1 winter             0.999
## 2 served.cold        0.999
## 3 fish               0.999
## 4 italian            0.999
## 5 high.protein       0.999
## 6 snacks             0.999

Therefore, it seems the first 100 tags would be sufficient for the analysis.

tags %->% {
  count(tag)
  arrange(desc(n))
  head(100)
  mutate(buckets = desc(ntile(n, 4)))
} %>%
  ggplot(aes(fct_reorder(fct_reorder(tag, tag, .desc = TRUE), n), n)) +
    coord_flip() +
    facet_wrap(vars(buckets), nrow = 1, scales = "free_y") +
    geom_col() +
    geom_label(aes(label = n), hjust = "right", nudge_y = 100000) +
    ggtitle(glue("Top 100 tags (of total {n_distinct(tags$tag)})")) +
    annotate("segment",
      y = nrow(tags_dummies), yend = nrow(tags_dummies), x = 0, xend = 20
    ) +
    theme(
      axis.title.y = element_blank(),
      strip.text.x = element_blank()
    )

plot of chunk top-tags

Feature Extraction from Tags

From the chart, we can identify some useful tags. Of course, tags that already cover a good proportion of the data are useless as they introduce little variation for identification. Each tag on its own is also useless because apart from the first few tags, each tag has very little coverage of the data. Rather, we are looking for complementary tags or subsets that give good coverage of the data. We can use exploratory factor analysis to identify such latent factors. Of course, we could use PCA as well. One could even argue that since each tag is a category, we need to use multiple correspondence analysis instead. But this is unnecessary, since our tags are simply one-hot encoded and therefore are just dummies that can be treated as continuous.

tags_factors <- tags_dummies %->% {
  extract(1:100)
  { factanal(~., 20, .) }
  status <- c(pval = PVAL[[1]], converged = converged)
  use_series(loadings)
  (\(.) t(.[1:dim(.)[1], ]))
  as_tibble(rownames = "factor")
  pivot_longer(!factor, "tag")
  mutate(
    factor = str_replace(factor, "(Factor)(.*)", "fct_\\2"),
    value = abs(value)
  )
  group_by(factor)
  arrange(desc(value), .by_group = TRUE)
  slice_head(n = 10)
  set_attr("status", status)
}
tags_factors %@% "status"
##      pval converged 
##         0         1
ggplot(tags_factors, aes(tag, value)) +
  geom_col() +
  coord_flip() +
  facet_wrap(vars(factor), scales = "free", nrow = 5) +
  ylab("value in factor") +
  ggtitle("Grouping of tags into latent factors") +
  theme(
    axis.text.x = element_blank(), strip.text.x = element_blank(),
    axis.title.y = element_blank()
  )

plot of chunk loads

The next step in exploratory factor analysis is to name our factors and group our variables into said factors. I exclude ingredient tags as the recipes data has the complete set already. Same for time to prepare. I also add dish type and occasion which the factor analysis hasn’t picked up.

The goal below is to aggregate as much of the data into groups. First creating the names of the new variables:

tags_keep <- set_names(names(tags_dummies)) %$>% {
  dietary <- dietary
  easy <- easy

  has_occasion <- occasion
  occasion <- list(
    dinner_party = dinner.party, holiday_event = holiday.event,
    weeknight = weeknight, picnic = picnic
  )

  dish_type <- list(
    main_dish = c(main.dish, lunch, breakfast, brunch),
    desserts = c(desserts, appetizers, cakes, snacks),
    bread = breads, salad = salads, soup = soups.stews, beverage = beverages
  )

  side_dish <- c(
    side.dishes, desserts, appetizers, cakes, snacks,
    breads, salads, soups.stews, beverages
  )

  healthy <- str_subset(., "\\bhealthy\\b")
  low_in_something <- str_subset(., "\\blow\\b")
  high_in_something <- str_subset(., "\\bhigh\\b")

  has_cuisine <- cuisine
  cuisine <- list(
    american = str_subset(., "\\bamerican\\b"),
    european =
      c(str_subset(., "\\beuropean\\b"), str_subset(., "\\bitalian\\b")),
    asian = str_subset(., "\\basian\\b")
  )

  has_equipment <- c(equipment, oven, stove.top, small.appliance)
  oven <- oven
  stove <- stove.top
  small_appliance <- small.appliance

  has_num_servings <- c(number.of.servings, for.1.or.2, for.large.groups)
  serve_small <- for.1.or.2
  serve_large <- for.large.groups

  season <- list(
    winter = winter, summer = summer, fall = fall,
    spring = spring, seasonal = seasonal
  )

  has_taste <- c(taste.mood, sweet, savory, spicy)
  taste <- list(sweet = sweet, savory = savory, spicy = spicy)
}
recipes_tags <- transmute(
  tags_dummies,
  !!!map(flatten(tags_keep), ~ if_else(reduce(tags_dummies[.], `+`) > 0, 1, 0))
) %>%
  mutate(id = tags_dummies %@% ids) %>%
  select(id, everything())

Now our recipe tags are finally ready.

vroom_write(recipes_tags, "~/Workspace/foodrecipes/out/recipes_tags.csv")