Skip to main content

dplyr Cheatsheet to DataFrames.jl Page 2

·9 mins

Continuing Where We Left #

This post shows how to translate the second page of the dplyr cheatsheet to DataFrames.jl and Julia commnads.

using RDatasets, RCall, DataFrames, StatsBase, Statistics

R"library(tidyverse)";

Setting up the data using RCall #

mpg = dataset("ggplot2", "mpg");

Vectorized Functions #

lag, lead #

julia> R"""
       mutate(mpg, leadyear = lead(year)) %>% select(year, leadyear) %>% tail
       """
RObject{VecSxp}
# A tibble: 6 x 2
   year leadyear
  <int>    <int>
1  1999     2008
2  2008     2008
3  2008     1999
4  1999     1999
5  1999     2008
6  2008       NA

In Julia:

julia> transform(mpg, :Year => (x->vcat(x[2:end], missing)) => :leadYear)[
           :, [:Year, :leadYear]] |> x->last(x, 10)
10×2 DataFrame
 Row  Year   leadYear
      Int32  Int32?
─────┼─────────────────
   1   1999      2008
   2   2008      2008
   3   2008      1999
   4   1999      1999
   5   1999      2008
   6   2008      2008
   7   2008      1999
   8   1999      1999
   9   1999      2008
  10   2008   missing

Note that one needs to wrap the anonymous function inside parentheses to avoid mixing with the => operator.

cumall #

This function returns always false from the first false value it sees. For example:

julia> R"""
       x <- c(1, 3, 5, 9, 4, 3, 2, 2)
       cumall(x < 6)
       """
RObject{LglSxp}
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

This is just a commulative product for logical values:

julia> x = [1, 3, 5, 9, 4, 3, 2, 2];

julia> cumprod(x .< 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 1  1  1  0  0  0  0  0

cumany #

This one returns true from the first true it sees:

julia> R"""
       cumany(x > 6)
       """
RObject{LglSxp}
[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

It would be equivalent to a logical sum:

julia> cumsum(x .> 6)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 0  0  0  1  1  1  1  1

cummax #

This just takes binary comparisons between the present element and its previous element:

julia> R"cummax(x)"
RObject{RealSxp}
[1] 1 3 5 9 9 9 9 9

It is easy with the more general function accumulate in Julia:

julia> accumulate(max, x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  9  9  9  9  9

One could have in fact used this function with the previous examples as well by writing the appropriate anonymous function. Other functions such as mean, max etc. follow the same rule.

Rank functions #

julia> R"cume_dist(x)"
RObject{RealSxp}
[1] 0.125 0.625 0.875 1.000 0.750 0.625 0.375 0.375

It is just the values of the empirical cdf:

julia> ecdf(x)(x)'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.125  0.625  0.875  1.0  0.75  0.625  0.375  0.375
julia> R"dense_rank(x)"
RObject{IntSxp}
[1] 1 3 5 6 4 3 2 2
julia> denserank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  3  5  6  4  3  2  2
julia> R"min_rank(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 4 2 2

Julia equivalent is the competerank:

julia> competerank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  4  2  2

TODO The dplyr ntile function seems rather unstandard and so the results differ from Julia:

julia> R"ntile(x, 3)"
RObject{IntSxp}
[1] 1 2 3 3 2 2 1 1
julia> searchsortedfirst.(Ref(nquantile(x,2)), x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  2  3  3  3  2  2  2

As for percen_rank it is just a normalization to the 0-1 interval, hence requires subtracting 1 first:

julia> R"percent_rank(x)"
RObject{RealSxp}
[1] 0.0000000 0.4285714 0.8571429 1.0000000 0.7142857 0.4285714 0.1428571
[8] 0.1428571
julia> ((competerank(x).-1)/(length(x)-1))'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  0.428571  0.857143  1.0  0.714286  0.428571  0.142857  0.142857

As for row_number rank:

julia> R"row_number(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 5 2 3
julia> ordinalrank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
 1  4  7  8  6  5  2  3

between #

Julia simply understands the between notion in a mathematical notation:

julia> R"between(x, 3, 6)"
RObject{LglSxp}
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
julia> (3 .<= x .<= 6)'
1×8 adjoint(::BitVector) with eltype Bool:
 0  1  1  0  1  1  0  0

near #

Use the approximate notation (or the isapprox function):

julia> R"near(2, 1.9999999999)"
RObject{LglSxp}
[1] TRUE
julia> 2  1.9999999999
true

case_when #

Julia doesn’t have a syntax for cases yet. The shortest equivalent that comes to mind is as follows:

julia> R"""
       case_when(x > 3 & x < 5 ~ "medium",
                 x <= 3 ~ "small",
                 x >= 5 ~ "large",
                 TRUE ~ "unknown")
       """
RObject{StrSxp}
[1] "small"  "small"  "large"  "large"  "medium" "small"  "small"  "small" 
julia> map(x) do x_
           if (3 < x_ < 5) "medium"
           elseif (x_ <= 3) "small"
           elseif (5 <= x_) "large"
           else "unknown"
           end
       end
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "medium"
 "small"
 "small"
 "small"

coalesce #

Both functions are inspired by the SQL command and so work nearly identically:

julia> R"""
       y <- c(NA, 2, 5, NA, 6)
       z <- c(3, 7, NA, 9, 4)
       coalesce(y,z)
       """
RObject{RealSxp}
[1] 3 2 5 9 6
julia> y = [missing, 2, 5, missing, 6];

julia> z = [3, 7, missing, 9, 4];

julia> coalesce.(y, z)'
1×5 adjoint(::Vector{Int64}) with eltype Int64:
 3  2  5  9  6

It is important to note that vectorizing coalesce (i.e. the dot) is necessary. Otherwise, since the y vector isn’t missing, nothing happens.

if_else #

The julia equivalent is the vectorized ifelse:

julia> R"""
       if_else(x > 3, "large", "small")
       """
RObject{StrSxp}
[1] "small" "small" "large" "large" "large" "small" "small" "small"
julia> ifelse.(x .> 3, "large", "small")
8-element Vector{String}:
 "small"
 "small"
 "large"
 "large"
 "large"
 "small"
 "small"
 "small"

na_if #

Simply use the more general functions in Julia:

julia> R"na_if(x, 9)"
RObject{RealSxp}
[1]  1  3  5 NA  4  3  2  2
julia> replace(x, 9=>missing)'
1×8 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 1  3  5  missing  4  3  2  2

pmax and pmin #

julia> R"pmax(y,z)"
RObject{RealSxp}
[1] NA  7 NA NA  6
julia> max.(y,z)'
1×5 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
 missing  7  missing  missing  6

recode #

Again, simply using the more general function replace:

julia> R"""
       cvec <- c("a", "b", "a", "c", "b", "a", "c")
       recode(cvec, a = "apple")
       """
RObject{StrSxp}
[1] "apple" "b"     "apple" "c"     "b"     "apple" "c"    
julia> cvec = ["a", "b", "a", "c", "b", "a", "c"];

julia> replace(cvec, "a"=>"apple")
7-element Vector{String}:
 "apple"
 "b"
 "apple"
 "c"
 "b"
 "apple"
 "c"

Summary Functions #

Back to dataframes where summary functions are better illustrated.

n() #

julia> R"mpg %>% group_by(cyl) %>% summarise(n=n())"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl     n
  <int> <int>
1     4    81
2     5     4
3     6    79
4     8    70
julia> groupby(mpg, :Cyl) |> x->combine(x, nrow)
4×2 DataFrame
 Row  Cyl    nrow
      Int32  Int64
─────┼──────────────
   1      4     81
   2      5      4
   3      6     79
   4      8     70

n_distinct #

julia> R"mpg %>% group_by(cyl) %>% summarise(trans=n_distinct(trans))"
RObject{VecSxp}
# A tibble: 4 x 2
    cyl trans
  <int> <int>
1     4     9
2     5     2
3     6     8
4     8     8
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => 
           (x_->length(unique(x_))) => :Trans)
4×2 DataFrame
 Row  Cyl    Trans
      Int32  Int64
─────┼──────────────
   1      4      9
   2      5      2
   3      6      8
   4      8      8

Alternatively, one could just composition the two inner functions instead (use \circ to write the operator):

julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => length  unique => :Trans)
4×2 DataFrame
 Row  Cyl    Trans
      Int32  Int64
─────┼──────────────
   1      4      9
   2      5      2
   3      6      8
   4      8      8

mean and median #

julia> R"mean(x)"
RObject{RealSxp}
[1] 3.625
julia> mean(x)
3.625
julia> R"median(x)"
RObject{RealSxp}
[1] 3
julia> median(x)
3.0

first, last, nth #

The Julia equivalents are also named first and last. As for nth, one has to simply use array indexing.

Rank and Spread functions #

These are all available in Julia by similar names:

julia> R"c(max(x), min(x), quantile(x, 0.25), IQR(x), mad(x), sd(x), var(x))"
RObject{RealSxp}
                       25%                                     
9.000000 1.000000 2.000000 2.250000 1.482600 2.503569 6.267857 
julia> [maximum(x); minimum(x); quantile(x, 0.25); iqr(x); mad(x); std(x); var(x)]'
1×7 adjoint(::Vector{Float64}) with eltype Float64:
 9.0  1.0  2.0  2.25  1.4826  2.50357  6.26786

Row Names #

Julia dataframes do not have row names by design so there is nothing to worry about here.

Combining Tables #

Combining tables is really easy in Julia and is similar to how it is done in R for the most part (since both are inspired by SQL verbs).

Column and row binding #

Binding columns is done with the hcat function:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(c = 7:9, d = 10:12)
       bind_cols(df1, df2)
       """
RObject{VecSxp}
# A tibble: 3 x 4
      a     b     c     d
  <int> <int> <int> <int>
1     1     4     7    10
2     2     5     8    11
3     3     6     9    12
julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(c=7:9, d=10:12);

julia> hcat(df1, df2)
3×4 DataFrame
 Row  a      b      c      d
      Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1      1      4      7     10
   2      2      5      8     11
   3      3      6      9     12

Binding rows requires vcat:

julia> R"""
       df1 <- tibble(a = 1:3, b = 4:6)
       df2 <- tibble(a = 7:9, b = 10:12)
       bind_rows(df1, df2)
       """
RObject{VecSxp}
# A tibble: 6 x 2
      a     b
  <int> <int>
1     1     4
2     2     5
3     3     6
4     7    10
5     8    11
6     9    12
julia> df1 = DataFrame(a=1:3, b=4:6);

julia> df2 = DataFrame(a=7:9, b=10:12);

julia> vcat(df1, df2)
6×2 DataFrame
 Row  a      b
      Int64  Int64
─────┼──────────────
   1      1      4
   2      2      5
   3      3      6
   4      7     10
   5      8     11
   6      9     12

Joins #

Example data taken from DataFrames.jl:

julia> df1 = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
                            Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Category = [1, 2, 3, 4, 5]);

julia> df2 = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
                            Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                            Name = ["a", "b", "c", "d", "e"]);

julia> innerjoin(df1, df2, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
 Row  City       Job     Category  Name
      String     String  Int64     String
─────┼─────────────────────────────────────
   1  Amsterdam  Lawyer         1  a
   2  London     Lawyer         2  b
   3  London     Lawyer         3  b
   4  London     Lawyer         2  c
   5  London     Lawyer         3  c
   6  New York   Doctor         4  d
   7  New York   Doctor         5  d
   8  New York   Doctor         4  e
   9  New York   Doctor         5  e

In R, it would be:

julia> R"""
       df1 <- tibble(City = c("Amsterdam", "London", "London", "New York", "New York"),
                            Job = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Category = c(1, 2, 3, 4, 5))
       df2 <- tibble(Location = c("Amsterdam", "London", "London", "New York", "New York"),
                            Work = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
                            Name = c("a", "b", "c", "d", "e"))
       inner_join(df1, df2, by = c("City" = "Location", "Job" = "Work"))
       """
RObject{VecSxp}
# A tibble: 9 x 4
  City      Job    Category Name 
  <chr>     <chr>     <dbl> <chr>
1 Amsterdam Lawyer        1 a    
2 London    Lawyer        2 b    
3 London    Lawyer        2 c    
4 London    Lawyer        3 b    
5 London    Lawyer        3 c    
6 New York  Doctor        4 d    
7 New York  Doctor        4 e    
8 New York  Doctor        5 d    
9 New York  Doctor        5 e    

Set operations in R are somewhat redunant in the presence of join verbs, and so in Julia one needs to rely soley on those join verbs.

This concludes the dplyr cheatsheet translation to DataFrames.jl commands.