Continuing Where We Left

This post shows how to translate the second page of the dplyr cheatsheet to DataFrames.jl and Julia commnads.

using RDatasets, RCall, DataFrames, StatsBase, Statistics

R"library(tidyverse)";

Setting up the data using RCall

mpg = dataset("ggplot2", "mpg");

Vectorized Functions

lag, lead

julia> R"""
       mutate(mpg, leadyear = lead(year)) %>% select(year, leadyear) %>% tail
       """
RObject{VecSxp}
# A tibble: 6 x 2
   year leadyear
  <int>    <int>
1  1999     2008
2  2008     2008
3  2008     1999
4  1999     1999
5  1999     2008
6  2008       NA

In Julia:

julia> transform(mpg, :Year => (x->vcat(x[2:end], missing)) => :leadYear)[
           :, [:Year, :leadYear]] |> x->last(x, 10)
10×2 DataFrame
 Row │ Year   leadYear
Int32  Int32?
─────┼─────────────────
   11999      2008
   22008      2008
   32008      1999
   41999      1999
   51999      2008
   62008      2008
   72008      1999
   81999      1999
   91999      2008
  102008   missing

Note that one needs to wrap the anonymous function inside parentheses to avoid mixing with the => operator.

cumall

This function returns always false from the first false value it sees. For example:

julia> R"""
       x <- c(1, 3, 5, 9, 4, 3, 2, 2)
       cumall(x < 6)
       """
RObject{LglSxp}
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

This is just a commulative product for logical values:

julia> x = [1, 3, 5, 9, 4, 3, 2, 2];

julia> cumprod(x .< 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 1  1  1  0  0  0  0  0

cumany

This one returns true from the first true it sees:

julia> R"""
       cumany(x > 6)
       """
RObject{LglSxp}
[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

It would be equivalent to a logical sum:

julia> cumsum(x .> 6)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 0  0  0  1  1  1  1  1

cummax

This just takes binary comparisons between the present element and its previous element:

julia> R"cummax(x)"
RObject{RealSxp}
[1] 1 3 5 9 9 9 9 9

It is easy with the more general function accumulate in Julia:

julia> accumulate(max, x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  9  9  9  9  9

One could have in fact used this function with the previous examples as well by writing the appropriate anonymous function. Other functions such as mean, max etc. follow the same rule.

Rank functions

julia> R"cume_dist(x)"
RObject{RealSxp}
[1] 0.125 0.625 0.875 1.000 0.750 0.625 0.375 0.375

It is just the values of the empirical cdf:

julia> ecdf(x)(x)'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.125  0.625  0.875  1.0  0.75  0.625  0.375  0.375
julia> R"dense_rank(x)"
RObject{IntSxp}
[1] 1 3 5 6 4 3 2 2
julia> denserank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  6  4  3  2  2
julia> R"min_rank(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 4 2 2

Julia equivalent is the competerank:

julia> competerank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  4  2  2

TODO The dplyr ntile function seems rather unstandard and so the results differ from Julia:

julia> R"ntile(x, 3)"
RObject{IntSxp}
[1] 1 2 3 3 2 2 1 1
julia> searchsortedfirst.(Ref(nquantile(x,2)), x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  2  3  3  3  2  2  2

As for percen_rank it is just a normalization to the 0-1 interval, hence requires subtracting 1 first:

julia> R"percent_rank(x)"
RObject{RealSxp}
[1] 0.0000000 0.4285714 0.8571429 1.0000000 0.7142857 0.4285714 0.1428571
[8] 0.1428571
julia> ((competerank(x).-1)/(length(x)-1))'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  0.428571  0.857143  1.0  0.714286  0.428571  0.142857  0.142857

As for row_number rank:

julia> R"row_number(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 5 2 3
julia> ordinalrank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  5  2  3

between

Julia simply understands the between notion in a mathematical notation:

julia> R"between(x, 3, 6)"
RObject{LglSxp}
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
julia> (3 .<= x .<= 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 0  1  1  0  1  1  0  0

near

Use the approximate notation (or the isapprox function):

julia> R"near(2, 1.9999999999)"
RObject{LglSxp}
[1] TRUE
julia> 2  1.9999999999
true

case_when

Julia doesn’t have a syntax for cases yet. The shortest equivalent that comes to mind is as follows:

julia> R"""
       case_when(x > 3 & x < 5 ~ "medium",
                 x <= 3 ~ "small",
                 x >= 5 ~ "large",
                 TRUE ~ "unknown")
       """
RObject{StrSxp}
[1] "small"  "small"  "large"  "large"  "medium" "small"  "small"  "small" 
julia> map(x) do x_
           if (3 < x_ < 5) "medium"
           elseif (x_ <= 3) "small"
           elseif (5 <= x_) "large"
           else "unknown"
           end
       end
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "medium"
 "small"
 "small"
 "small"

coalesce

Both functions are inspired by the SQL command and so work nearly identically:

julia> R"""
       y <- c(NA, 2, 5, NA, 6)
       z <- c(3, 7, NA, 9, 4)
       coalesce(y,z)
       """
RObject{RealSxp}
[1] 3 2 5 9 6
julia> y = [missing, 2, 5, missing, 6];

julia> z = [3, 7, missing, 9, 4];

julia> coalesce.(y, z)'
1×5 adjoint(::Vector{Int64}) with eltype Int64:
 3  2  5  9  6

It is important to note that vectorizing coalesce (i.e. the dot) is necessary. Otherwise, since the y vector isn’t missing, nothing happens.

if_else

The julia equivalent is the vectorized ifelse:

julia> R"""
       if_else(x > 3, "large", "small")
       """
RObject{StrSxp}
[1] "small" "small" "large" "large" "large" "small" "small" "small"
julia> ifelse.(x .> 3, "large", "small")
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "large"
 "small"
 "small"
 "small"

na_if

Simply use the more general functions in Julia:

julia> R"na_if(x, 9)"
RObject{RealSxp}
[1]  1  3  5 NA  4  3  2  2
julia> replace(x, 9=>missing)'
1×8 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 1  3  5  missing  4  3  2  2

pmax and pmin

julia> R"pmax(y,z)"
RObject{RealSxp}
[1] NA  7 NA NA  6
julia> max.(y,z)'
1×5 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 missing  7  missing  missing  6

recode

Again, simply using the more general function replace:

julia> R"""
       cvec <- c("a", "b", "a", "c", "b", "a", "c")
       recode(cvec, a = "apple")
       """
RObject{StrSxp}
[1] "apple" "b"     "apple" "c"     "b"     "apple" "c"    
julia> cvec = ["a", "b", "a", "c", "b", "a", "c"];

julia> replace(cvec, "a"=>"apple")
7-element Vector{String}:
 "apple"
 "b"
 "apple"
 "c"
 "b"
 "apple"
 "c"

Summary Functions

Back to dataframes where summary functions are better illustrated.

n()

julia> R"mpg %>% group_by(cyl) %>% summarise(n=n())"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl     n
  <int> <int>
1     4    81
2     5     4
3     6    79
4     8    70
julia> groupby(mpg, :Cyl) |> x->combine(x, nrow)
4×2 DataFrame
 Row │ Cyl    nrow
Int32  Int64
─────┼──────────────
   14     81
   25      4
   36     79
   48     70

n_distinct

julia> R"mpg %>% group_by(cyl) %>% summarise(trans=n_distinct(trans))"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl trans
  <int> <int>
1     4     9
2     5     2
3     6     8
4     8     8
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => 
           (x_->length(unique(x_))) => :Trans)
4×2 DataFrame
 Row │ Cyl    Trans
Int32  Int64
─────┼──────────────
   14      9
   25      2
   36      8
   48      8

Alternatively, one could just composition the two inner functions instead (use \circ to write the operator):

julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => length  unique => :Trans)
4×2 DataFrame
 Row │ Cyl    Trans
Int32  Int64
─────┼──────────────
   14      9
   25      2
   36      8
   48      8

mean and median

julia> R"mean(x)"
RObject{RealSxp}
[1] 3.625
julia> mean(x)
3.625
julia> R"median(x)"
RObject{RealSxp}
[1] 3
julia> median(x)
3.0

first, last, nth

The Julia equivalents are also named first and last. As for nth, one has to simply use array indexing.

Rank and Spread functions

These are all available in Julia by similar names:

julia> R"c(max(x), min(x), quantile(x, 0.25), IQR(x), mad(x), sd(x), var(x))"
RObject{RealSxp}
                       25%                                     
9.000000 1.000000 2.000000 2.250000 1.482600 2.503569 6.267857 
julia> [maximum(x); minimum(x); quantile(x, 0.25); iqr(x); mad(x); std(x); var(x)]'
1×7 adjoint(::Vector{Float64}) with eltype Float64:
 9.0  1.0  2.0  2.25  1.4826  2.50357  6.26786

Row Names

Julia dataframes do not have row names by design so there is nothing to worry about here.

Combining Tables

Combining tables is really easy in Julia and is similar to how it is done in R for the most part (since both are inspired by SQL verbs).

Column and row binding

Binding columns is done with the hcat function:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(c = 7:9, d = 10:12)
       bind_cols(df1, df2)
       """
RObject{VecSxp}
# A tibble: 3 x 4
      a     b     c     d
  <int> <int> <int> <int>
1     1     4     7    10
2     2     5     8    11
3     3     6     9    12
julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(c=7:9, d=10:12);

julia> hcat(df1, df2)
3×4 DataFrame
 Row │ a      b      c      d
Int64  Int64  Int64  Int64
─────┼────────────────────────────
   11      4      7     10
   22      5      8     11
   33      6      9     12

Binding rows requires vcat:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(a = 7:9, b = 10:12)
       bind_rows(df1, df2)
       """
RObject{VecSxp}
# A tibble: 6 x 2
      a     b
  <int> <int>
1     1     4
2     2     5
3     3     6
4     7    10
5     8    11
6     9    12
julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(a=7:9, b=10:12);

julia> vcat(df1, df2)
6×2 DataFrame
 Row │ a      b
Int64  Int64
─────┼──────────────
   11      4
   22      5
   33      6
   47     10
   58     11
   69     12

Joins

Example data taken from DataFrames.jl:

julia> df1 = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
                            Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Category = [1, 2, 3, 4, 5]);

julia> df2 = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
                            Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Name = ["a", "b", "c", "d", "e"]);

julia> innerjoin(df1, df2, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
 Row │ City       Job     Category  Name
String     String  Int64     String
─────┼─────────────────────────────────────
   1 │ Amsterdam  Lawyer         1  a
   2 │ London     Lawyer         2  b
   3 │ London     Lawyer         3  b
   4 │ London     Lawyer         2  c
   5 │ London     Lawyer         3  c
   6 │ New York   Doctor         4  d
   7 │ New York   Doctor         5  d
   8 │ New York   Doctor         4  e
   9 │ New York   Doctor         5  e

In R, it would be:

julia> R"""
       df1 <- tibble(City = c("Amsterdam", "London", "London", "New York", "New York"),
                            Job = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Category = c(1, 2, 3, 4, 5))
       df2 <- tibble(Location = c("Amsterdam", "London", "London", "New York", "New York"),
                            Work = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Name = c("a", "b", "c", "d", "e"))
       inner_join(df1, df2, by = c("City" = "Location", "Job" = "Work"))
       """
RObject{VecSxp}
# A tibble: 9 x 4
  City      Job    Category Name 
  <chr>     <chr>     <dbl> <chr>
1 Amsterdam Lawyer        1 a    
2 London    Lawyer        2 b    
3 London    Lawyer        2 c    
4 London    Lawyer        3 b    
5 London    Lawyer        3 c    
6 New York  Doctor        4 d    
7 New York  Doctor        4 e    
8 New York  Doctor        5 d    
9 New York  Doctor        5 e    

Set operations in R are somewhat redunant in the presence of join verbs, and so in Julia one needs to rely soley on those join verbs.

This concludes the dplyr cheatsheet translation to DataFrames.jl commands.