Continuing Where We Left
This post shows how to translate the second page of the dplyr
cheatsheet to DataFrames.jl
and Julia commnads.
using RDatasets, RCall, DataFrames, StatsBase, Statistics
R"library(tidyverse)";
Setting up the data using RCall
mpg = dataset("ggplot2", "mpg");
Vectorized Functions
lag
, lead
julia> R"""
mutate(mpg, leadyear = lead(year)) %>% select(year, leadyear) %>% tail
"""
RObject{VecSxp}
# A tibble: 6 x 2
year leadyear
<int> <int>
1 1999 2008
2 2008 2008
3 2008 1999
4 1999 1999
5 1999 2008
6 2008 NA
In Julia:
julia> transform(mpg, :Year => (x->vcat(x[2:end], missing)) => :leadYear)[
:, [:Year, :leadYear]] |> x->last(x, 10)
10×2 DataFrame
Row │ Year leadYear
│ Int32 Int32?
─────┼─────────────────
1 │ 1999 2008
2 │ 2008 2008
3 │ 2008 1999
4 │ 1999 1999
5 │ 1999 2008
6 │ 2008 2008
7 │ 2008 1999
8 │ 1999 1999
9 │ 1999 2008
10 │ 2008 missing
Note that one needs to wrap the anonymous function inside parentheses to avoid mixing with the =>
operator.
cumall
This function returns always false from the first false value it sees. For example:
julia> R"""
x <- c(1, 3, 5, 9, 4, 3, 2, 2)
cumall(x < 6)
"""
RObject{LglSxp}
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
This is just a commulative product for logical values:
julia> x = [1, 3, 5, 9, 4, 3, 2, 2];
julia> cumprod(x .< 6)'
1×8 adjoint(::BitVector) with eltype Bool:
1 1 1 0 0 0 0 0
cumany
This one returns true from the first true it sees:
julia> R"""
cumany(x > 6)
"""
RObject{LglSxp}
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
It would be equivalent to a logical sum:
julia> cumsum(x .> 6)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
0 0 0 1 1 1 1 1
cummax
This just takes binary comparisons between the present element and its previous element:
julia> R"cummax(x)"
RObject{RealSxp}
[1] 1 3 5 9 9 9 9 9
It is easy with the more general function accumulate
in Julia:
julia> accumulate(max, x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 3 5 9 9 9 9 9
One could have in fact used this function with the previous examples as well by writing the appropriate anonymous function. Other functions such as mean
, max
etc. follow the same rule.
Rank functions
julia> R"cume_dist(x)"
RObject{RealSxp}
[1] 0.125 0.625 0.875 1.000 0.750 0.625 0.375 0.375
It is just the values of the empirical cdf:
julia> ecdf(x)(x)'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
0.125 0.625 0.875 1.0 0.75 0.625 0.375 0.375
julia> R"dense_rank(x)"
RObject{IntSxp}
[1] 1 3 5 6 4 3 2 2
julia> denserank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 3 5 6 4 3 2 2
julia> R"min_rank(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 4 2 2
Julia equivalent is the competerank
:
julia> competerank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 4 7 8 6 4 2 2
TODO The dplyr
ntile
function seems rather unstandard and so the results differ from Julia:
julia> R"ntile(x, 3)"
RObject{IntSxp}
[1] 1 2 3 3 2 2 1 1
julia> searchsortedfirst.(Ref(nquantile(x,2)), x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 2 3 3 3 2 2 2
As for percen_rank
it is just a normalization to the 0-1 interval, hence requires subtracting 1 first:
julia> R"percent_rank(x)"
RObject{RealSxp}
[1] 0.0000000 0.4285714 0.8571429 1.0000000 0.7142857 0.4285714 0.1428571
[8] 0.1428571
julia> ((competerank(x).-1)/(length(x)-1))'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
0.0 0.428571 0.857143 1.0 0.714286 0.428571 0.142857 0.142857
As for row_number
rank:
julia> R"row_number(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 5 2 3
julia> ordinalrank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 4 7 8 6 5 2 3
between
Julia simply understands the between notion in a mathematical notation:
julia> R"between(x, 3, 6)"
RObject{LglSxp}
[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
julia> (3 .<= x .<= 6)'
1×8 adjoint(::BitVector) with eltype Bool:
0 1 1 0 1 1 0 0
near
Use the approximate notation (or the isapprox
function):
julia> R"near(2, 1.9999999999)"
RObject{LglSxp}
[1] TRUE
julia> 2 ≈ 1.9999999999
true
case_when
Julia doesn’t have a syntax for cases yet. The shortest equivalent that comes to mind is as follows:
julia> R"""
case_when(x > 3 & x < 5 ~ "medium",
x <= 3 ~ "small",
x >= 5 ~ "large",
TRUE ~ "unknown")
"""
RObject{StrSxp}
[1] "small" "small" "large" "large" "medium" "small" "small" "small"
julia> map(x) do x_
if (3 < x_ < 5) "medium"
elseif (x_ <= 3) "small"
elseif (5 <= x_) "large"
else "unknown"
end
end
8-element Vector{String}:
"small"
"small"
"large"
"large"
"medium"
"small"
"small"
"small"
coalesce
Both functions are inspired by the SQL command and so work nearly identically:
julia> R"""
y <- c(NA, 2, 5, NA, 6)
z <- c(3, 7, NA, 9, 4)
coalesce(y,z)
"""
RObject{RealSxp}
[1] 3 2 5 9 6
julia> y = [missing, 2, 5, missing, 6];
julia> z = [3, 7, missing, 9, 4];
julia> coalesce.(y, z)'
1×5 adjoint(::Vector{Int64}) with eltype Int64:
3 2 5 9 6
It is important to note that vectorizing coalesce (i.e. the dot) is necessary. Otherwise, since the y
vector isn’t missing, nothing happens.
if_else
The julia equivalent is the vectorized ifelse
:
julia> R"""
if_else(x > 3, "large", "small")
"""
RObject{StrSxp}
[1] "small" "small" "large" "large" "large" "small" "small" "small"
julia> ifelse.(x .> 3, "large", "small")
8-element Vector{String}:
"small"
"small"
"large"
"large"
"large"
"small"
"small"
"small"
na_if
Simply use the more general functions in Julia:
julia> R"na_if(x, 9)"
RObject{RealSxp}
[1] 1 3 5 NA 4 3 2 2
julia> replace(x, 9=>missing)'
1×8 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
1 3 5 missing 4 3 2 2
pmax
and pmin
julia> R"pmax(y,z)"
RObject{RealSxp}
[1] NA 7 NA NA 6
julia> max.(y,z)'
1×5 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
missing 7 missing missing 6
recode
Again, simply using the more general function replace
:
julia> R"""
cvec <- c("a", "b", "a", "c", "b", "a", "c")
recode(cvec, a = "apple")
"""
RObject{StrSxp}
[1] "apple" "b" "apple" "c" "b" "apple" "c"
julia> cvec = ["a", "b", "a", "c", "b", "a", "c"];
julia> replace(cvec, "a"=>"apple")
7-element Vector{String}:
"apple"
"b"
"apple"
"c"
"b"
"apple"
"c"
Summary Functions
Back to dataframes where summary functions are better illustrated.
n()
julia> R"mpg %>% group_by(cyl) %>% summarise(n=n())"
RObject{VecSxp}
# A tibble: 4 x 2
cyl n
<int> <int>
1 4 81
2 5 4
3 6 79
4 8 70
julia> groupby(mpg, :Cyl) |> x->combine(x, nrow)
4×2 DataFrame
Row │ Cyl nrow
│ Int32 Int64
─────┼──────────────
1 │ 4 81
2 │ 5 4
3 │ 6 79
4 │ 8 70
n_distinct
julia> R"mpg %>% group_by(cyl) %>% summarise(trans=n_distinct(trans))"
RObject{VecSxp}
# A tibble: 4 x 2
cyl trans
<int> <int>
1 4 9
2 5 2
3 6 8
4 8 8
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans =>
(x_->length(unique(x_))) => :Trans)
4×2 DataFrame
Row │ Cyl Trans
│ Int32 Int64
─────┼──────────────
1 │ 4 9
2 │ 5 2
3 │ 6 8
4 │ 8 8
Alternatively, one could just composition the two inner functions instead (use \circ to write the operator):
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => length ∘ unique => :Trans)
4×2 DataFrame
Row │ Cyl Trans
│ Int32 Int64
─────┼──────────────
1 │ 4 9
2 │ 5 2
3 │ 6 8
4 │ 8 8
mean
and median
julia> R"mean(x)"
RObject{RealSxp}
[1] 3.625
julia> mean(x)
3.625
julia> R"median(x)"
RObject{RealSxp}
[1] 3
julia> median(x)
3.0
first
, last
, nth
The Julia equivalents are also named first and last. As for nth
, one has to simply use array indexing.
Rank and Spread functions
These are all available in Julia by similar names:
julia> R"c(max(x), min(x), quantile(x, 0.25), IQR(x), mad(x), sd(x), var(x))"
RObject{RealSxp}
25%
9.000000 1.000000 2.000000 2.250000 1.482600 2.503569 6.267857
julia> [maximum(x); minimum(x); quantile(x, 0.25); iqr(x); mad(x); std(x); var(x)]'
1×7 adjoint(::Vector{Float64}) with eltype Float64:
9.0 1.0 2.0 2.25 1.4826 2.50357 6.26786
Row Names
Julia dataframes do not have row names by design so there is nothing to worry about here.
Combining Tables
Combining tables is really easy in Julia and is similar to how it is done in R for the most part (since both are inspired by SQL verbs).
Column and row binding
Binding columns is done with the hcat
function:
julia> R"""
df1 <- tibble(a = 1:3, b = 4:6)
df2 <- tibble(c = 7:9, d = 10:12)
bind_cols(df1, df2)
"""
RObject{VecSxp}
# A tibble: 3 x 4
a b c d
<int> <int> <int> <int>
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
julia> df1 = DataFrame(a=1:3, b=4:6);
julia> df2 = DataFrame(c=7:9, d=10:12);
julia> hcat(df1, df2)
3×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
Binding rows requires vcat:
julia> R"""
df1 <- tibble(a = 1:3, b = 4:6)
df2 <- tibble(a = 7:9, b = 10:12)
bind_rows(df1, df2)
"""
RObject{VecSxp}
# A tibble: 6 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
4 7 10
5 8 11
6 9 12
julia> df1 = DataFrame(a=1:3, b=4:6);
julia> df2 = DataFrame(a=7:9, b=10:12);
julia> vcat(df1, df2)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
4 │ 7 10
5 │ 8 11
6 │ 9 12
Joins
Example data taken from DataFrames.jl
:
julia> df1 = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Category = [1, 2, 3, 4, 5]);
julia> df2 = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Name = ["a", "b", "c", "d", "e"]);
julia> innerjoin(df1, df2, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
Row │ City Job Category Name
│ String String Int64 String
─────┼─────────────────────────────────────
1 │ Amsterdam Lawyer 1 a
2 │ London Lawyer 2 b
3 │ London Lawyer 3 b
4 │ London Lawyer 2 c
5 │ London Lawyer 3 c
6 │ New York Doctor 4 d
7 │ New York Doctor 5 d
8 │ New York Doctor 4 e
9 │ New York Doctor 5 e
In R, it would be:
julia> R"""
df1 <- tibble(City = c("Amsterdam", "London", "London", "New York", "New York"),
Job = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
Category = c(1, 2, 3, 4, 5))
df2 <- tibble(Location = c("Amsterdam", "London", "London", "New York", "New York"),
Work = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
Name = c("a", "b", "c", "d", "e"))
inner_join(df1, df2, by = c("City" = "Location", "Job" = "Work"))
"""
RObject{VecSxp}
# A tibble: 9 x 4
City Job Category Name
<chr> <chr> <dbl> <chr>
1 Amsterdam Lawyer 1 a
2 London Lawyer 2 b
3 London Lawyer 2 c
4 London Lawyer 3 b
5 London Lawyer 3 c
6 New York Doctor 4 d
7 New York Doctor 4 e
8 New York Doctor 5 d
9 New York Doctor 5 e
Set operations in R are somewhat redunant in the presence of join verbs, and so in Julia one needs to rely soley on those join verbs.
This concludes the dplyr
cheatsheet translation to DataFrames.jl
commands.