Continuing Where We Left

This post shows how to translate the second page of the dplyr cheatsheet to DataFrames.jl and Julia commnads.

using RDatasets, RCall, DataFrames, StatsBase, Statistics

R"library(tidyverse)";

Setting up the data using RCall

mpg = dataset("ggplot2", "mpg");

Vectorized Functions

`lag`, `lead`

julia> R"""
       mutate(mpg, leadyear = lead(year)) %>% select(year, leadyear) %>% tail
       """
RObject{VecSxp}
# A tibble: 6 x 2
   year leadyear
  <int>    <int>
1  1999     2008
2  2008     2008
3  2008     1999
4  1999     1999
5  1999     2008
6  2008       NA

In Julia:

julia> transform(mpg, :Year => (x->vcat(x[2:end], missing)) => :leadYear)[
           :, [:Year, :leadYear]] |> x->last(x, 10)
10×2 DataFrame
 Row │ Year   leadYear
     │ Int32  Int32?
─────┼─────────────────
   1 │  1999      2008
   2 │  2008      2008
   3 │  2008      1999
   4 │  1999      1999
   5 │  1999      2008
   6 │  2008      2008
   7 │  2008      1999
   8 │  1999      1999
   9 │  1999      2008
  10 │  2008   missing

Note that one needs to wrap the anonymous function inside parentheses to avoid mixing with the => operator.

`cumall`

This function returns always false from the first false value it sees. For example:

julia> R"""
       x <- c(1, 3, 5, 9, 4, 3, 2, 2)
       cumall(x < 6)
       """
RObject{LglSxp}
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

This is just a commulative product for logical values:

julia> x = [1, 3, 5, 9, 4, 3, 2, 2];

julia> cumprod(x .< 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 1  1  1  0  0  0  0  0

`cumany`

This one returns true from the first true it sees:

julia> R"""
       cumany(x > 6)
       """
RObject{LglSxp}
[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

It would be equivalent to a logical sum:

julia> cumsum(x .> 6)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 0  0  0  1  1  1  1  1

`cummax`

This just takes binary comparisons between the present element and its previous element:

julia> R"cummax(x)"
RObject{RealSxp}
[1] 1 3 5 9 9 9 9 9

It is easy with the more general function accumulate in Julia:

julia> accumulate(max, x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  9  9  9  9  9

One could have in fact used this function with the previous examples as well by writing the appropriate anonymous function. Other functions such as mean, max etc. follow the same rule.

Rank functions

julia> R"cume_dist(x)"
RObject{RealSxp}
[1] 0.125 0.625 0.875 1.000 0.750 0.625 0.375 0.375

It is just the values of the empirical cdf:

julia> ecdf(x)(x)'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.125  0.625  0.875  1.0  0.75  0.625  0.375  0.375

julia> R"dense_rank(x)"
RObject{IntSxp}
[1] 1 3 5 6 4 3 2 2

julia> denserank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  6  4  3  2  2

julia> R"min_rank(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 4 2 2

Julia equivalent is the competerank:

julia> competerank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  4  2  2

TODO The dplyr ntile function seems rather unstandard and so the results differ from Julia:

julia> R"ntile(x, 3)"
RObject{IntSxp}
[1] 1 2 3 3 2 2 1 1

julia> searchsortedfirst.(Ref(nquantile(x,2)), x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  2  3  3  3  2  2  2

As for percen_rank it is just a normalization to the 0-1 interval, hence requires subtracting 1 first:

julia> R"percent_rank(x)"
RObject{RealSxp}
[1] 0.0000000 0.4285714 0.8571429 1.0000000 0.7142857 0.4285714 0.1428571
[8] 0.1428571

julia> ((competerank(x).-1)/(length(x)-1))'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  0.428571  0.857143  1.0  0.714286  0.428571  0.142857  0.142857

As for row_number rank:

julia> R"row_number(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 5 2 3

julia> ordinalrank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  5  2  3

`between`

Julia simply understands the between notion in a mathematical notation:

julia> R"between(x, 3, 6)"
RObject{LglSxp}
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE

julia> (3 .<= x .<= 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 0  1  1  0  1  1  0  0

`near`

Use the approximate notation (or the isapprox function):

julia> R"near(2, 1.9999999999)"
RObject{LglSxp}
[1] TRUE

julia> 2 ≈ 1.9999999999
true

`case_when`

Julia doesn’t have a syntax for cases yet. The shortest equivalent that comes to mind is as follows:

julia> R"""
       case_when(x > 3 & x < 5 ~ "medium",
                 x <= 3 ~ "small",
                 x >= 5 ~ "large",
                 TRUE ~ "unknown")
       """
RObject{StrSxp}
[1] "small"  "small"  "large"  "large"  "medium" "small"  "small"  "small"

julia> map(x) do x_
           if (3 < x_ < 5) "medium"
           elseif (x_ <= 3) "small"
           elseif (5 <= x_) "large"
           else "unknown"
           end
       end
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "medium"
 "small"
 "small"
 "small"

`coalesce`

Both functions are inspired by the SQL command and so work nearly identically:

julia> R"""
       y <- c(NA, 2, 5, NA, 6)
       z <- c(3, 7, NA, 9, 4)
       coalesce(y,z)
       """
RObject{RealSxp}
[1] 3 2 5 9 6

julia> y = [missing, 2, 5, missing, 6];

julia> z = [3, 7, missing, 9, 4];

julia> coalesce.(y, z)'
1×5 adjoint(::Vector{Int64}) with eltype Int64:
 3  2  5  9  6

It is important to note that vectorizing coalesce (i.e. the dot) is necessary. Otherwise, since the y vector isn’t missing, nothing happens.

`if_else`

The julia equivalent is the vectorized ifelse:

julia> R"""
       if_else(x > 3, "large", "small")
       """
RObject{StrSxp}
[1] "small" "small" "large" "large" "large" "small" "small" "small"

julia> ifelse.(x .> 3, "large", "small")
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "large"
 "small"
 "small"
 "small"

`na_if`

Simply use the more general functions in Julia:

julia> R"na_if(x, 9)"
RObject{RealSxp}
[1]  1  3  5 NA  4  3  2  2

julia> replace(x, 9=>missing)'
1×8 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 1  3  5  missing  4  3  2  2

`pmax` and `pmin`

julia> R"pmax(y,z)"
RObject{RealSxp}
[1] NA  7 NA NA  6

julia> max.(y,z)'
1×5 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 missing  7  missing  missing  6

`recode`

Again, simply using the more general function replace:

julia> R"""
       cvec <- c("a", "b", "a", "c", "b", "a", "c")
       recode(cvec, a = "apple")
       """
RObject{StrSxp}
[1] "apple" "b"     "apple" "c"     "b"     "apple" "c"

julia> cvec = ["a", "b", "a", "c", "b", "a", "c"];

julia> replace(cvec, "a"=>"apple")
7-element Vector{String}:
 "apple"
 "b"
 "apple"
 "c"
 "b"
 "apple"
 "c"

Summary Functions

Back to dataframes where summary functions are better illustrated.

`n()`

julia> R"mpg %>% group_by(cyl) %>% summarise(n=n())"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl     n
  <int> <int>
1     4    81
2     5     4
3     6    79
4     8    70

julia> groupby(mpg, :Cyl) |> x->combine(x, nrow)
4×2 DataFrame
 Row │ Cyl    nrow
     │ Int32  Int64
─────┼──────────────
   1 │     4     81
   2 │     5      4
   3 │     6     79
   4 │     8     70

`n_distinct`

julia> R"mpg %>% group_by(cyl) %>% summarise(trans=n_distinct(trans))"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl trans
  <int> <int>
1     4     9
2     5     2
3     6     8
4     8     8

julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => 
           (x_->length(unique(x_))) => :Trans)
4×2 DataFrame
 Row │ Cyl    Trans
     │ Int32  Int64
─────┼──────────────
   1 │     4      9
   2 │     5      2
   3 │     6      8
   4 │     8      8

Alternatively, one could just composition the two inner functions instead (use \circ to write the operator):

julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => length ∘ unique => :Trans)
4×2 DataFrame
 Row │ Cyl    Trans
     │ Int32  Int64
─────┼──────────────
   1 │     4      9
   2 │     5      2
   3 │     6      8
   4 │     8      8

`mean` and `median`

julia> R"mean(x)"
RObject{RealSxp}
[1] 3.625

julia> mean(x)
3.625

julia> R"median(x)"
RObject{RealSxp}
[1] 3

julia> median(x)
3.0

`first`, `last`, `nth`

The Julia equivalents are also named first and last. As for nth, one has to simply use array indexing.

Rank and Spread functions

These are all available in Julia by similar names:

julia> R"c(max(x), min(x), quantile(x, 0.25), IQR(x), mad(x), sd(x), var(x))"
RObject{RealSxp}
                       25%                                     
9.000000 1.000000 2.000000 2.250000 1.482600 2.503569 6.267857

julia> [maximum(x); minimum(x); quantile(x, 0.25); iqr(x); mad(x); std(x); var(x)]'
1×7 adjoint(::Vector{Float64}) with eltype Float64:
 9.0  1.0  2.0  2.25  1.4826  2.50357  6.26786

Row Names

Julia dataframes do not have row names by design so there is nothing to worry about here.

Combining Tables

Combining tables is really easy in Julia and is similar to how it is done in R for the most part (since both are inspired by SQL verbs).

Column and row binding

Binding columns is done with the hcat function:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(c = 7:9, d = 10:12)
       bind_cols(df1, df2)
       """
RObject{VecSxp}
# A tibble: 3 x 4
      a     b     c     d
  <int> <int> <int> <int>
1     1     4     7    10
2     2     5     8    11
3     3     6     9    12

julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(c=7:9, d=10:12);

julia> hcat(df1, df2)
3×4 DataFrame
 Row │ a      b      c      d
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      7     10
   2 │     2      5      8     11
   3 │     3      6      9     12

Binding rows requires vcat:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(a = 7:9, b = 10:12)
       bind_rows(df1, df2)
       """
RObject{VecSxp}
# A tibble: 6 x 2
      a     b
  <int> <int>
1     1     4
2     2     5
3     3     6
4     7    10
5     8    11
6     9    12

julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(a=7:9, b=10:12);

julia> vcat(df1, df2)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6
   4 │     7     10
   5 │     8     11
   6 │     9     12

Joins

Example data taken from DataFrames.jl:

julia> df1 = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
                            Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Category = [1, 2, 3, 4, 5]);

julia> df2 = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
                            Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Name = ["a", "b", "c", "d", "e"]);

julia> innerjoin(df1, df2, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
 Row │ City       Job     Category  Name
     │ String     String  Int64     String
─────┼─────────────────────────────────────
   1 │ Amsterdam  Lawyer         1  a
   2 │ London     Lawyer         2  b
   3 │ London     Lawyer         3  b
   4 │ London     Lawyer         2  c
   5 │ London     Lawyer         3  c
   6 │ New York   Doctor         4  d
   7 │ New York   Doctor         5  d
   8 │ New York   Doctor         4  e
   9 │ New York   Doctor         5  e

In R, it would be:

julia> R"""
       df1 <- tibble(City = c("Amsterdam", "London", "London", "New York", "New York"),
                            Job = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Category = c(1, 2, 3, 4, 5))
       df2 <- tibble(Location = c("Amsterdam", "London", "London", "New York", "New York"),
                            Work = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Name = c("a", "b", "c", "d", "e"))
       inner_join(df1, df2, by = c("City" = "Location", "Job" = "Work"))
       """
RObject{VecSxp}
# A tibble: 9 x 4
  City      Job    Category Name 
  <chr>     <chr>     <dbl> <chr>
1 Amsterdam Lawyer        1 a    
2 London    Lawyer        2 b    
3 London    Lawyer        2 c    
4 London    Lawyer        3 b    
5 London    Lawyer        3 c    
6 New York  Doctor        4 d    
7 New York  Doctor        4 e    
8 New York  Doctor        5 d    
9 New York  Doctor        5 e

Set operations in R are somewhat redunant in the presence of join verbs, and so in Julia one needs to rely soley on those join verbs.

This concludes the dplyr cheatsheet translation to DataFrames.jl commands.

Continuing Where We Left#

Setting up the data using RCall#

Vectorized Functions#

lag, lead#

cumall#

cumany#

cummax#

Rank functions#

between#

near#

case_when#

coalesce#

if_else#

na_if#

pmax and pmin#

recode#

Summary Functions#

n()#

n_distinct#

mean and median#

first, last, nth#

Rank and Spread functions#

Row Names#

Combining Tables#

Column and row binding#

Joins#