dplyr Cheatsheet to DataFrames.jl Page 2
Continuing Where We Left #
This post shows how to translate the second page of the dplyr
cheatsheet to DataFrames.jl
and Julia commnads.
using RDatasets, RCall, DataFrames, StatsBase, Statistics
R"library(tidyverse)";
Setting up the data using RCall #
mpg = dataset("ggplot2", "mpg");
Vectorized Functions #
lag
, lead
#
julia> R"""
mutate(mpg, leadyear = lead(year)) %>% select(year, leadyear) %>% tail
"""
RObject{VecSxp}
# A tibble: 6 x 2
year leadyear
<int> <int>
1 1999 2008
2 2008 2008
3 2008 1999
4 1999 1999
5 1999 2008
6 2008 NA
In Julia:
julia> transform(mpg, :Year => (x->vcat(x[2:end], missing)) => :leadYear)[
:, [:Year, :leadYear]] |> x->last(x, 10)
10×2 DataFrame
Row │ Year leadYear
│ Int32 Int32?
─────┼─────────────────
1 │ 1999 2008
2 │ 2008 2008
3 │ 2008 1999
4 │ 1999 1999
5 │ 1999 2008
6 │ 2008 2008
7 │ 2008 1999
8 │ 1999 1999
9 │ 1999 2008
10 │ 2008 missing
Note that one needs to wrap the anonymous function inside parentheses to avoid mixing with the =>
operator.
cumall
#
This function returns always false from the first false value it sees. For example:
julia> R"""
x <- c(1, 3, 5, 9, 4, 3, 2, 2)
cumall(x < 6)
"""
RObject{LglSxp}
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
This is just a commulative product for logical values:
julia> x = [1, 3, 5, 9, 4, 3, 2, 2];
julia> cumprod(x .< 6)'
1×8 adjoint(::BitVector) with eltype Bool:
1 1 1 0 0 0 0 0
cumany
#
This one returns true from the first true it sees:
julia> R"""
cumany(x > 6)
"""
RObject{LglSxp}
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
It would be equivalent to a logical sum:
julia> cumsum(x .> 6)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
0 0 0 1 1 1 1 1
cummax
#
This just takes binary comparisons between the present element and its previous element:
julia> R"cummax(x)"
RObject{RealSxp}
[1] 1 3 5 9 9 9 9 9
It is easy with the more general function accumulate
in Julia:
julia> accumulate(max, x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 3 5 9 9 9 9 9
One could have in fact used this function with the previous examples as well by writing the appropriate anonymous function. Other functions such as mean
, max
etc. follow the same rule.
Rank functions #
julia> R"cume_dist(x)"
RObject{RealSxp}
[1] 0.125 0.625 0.875 1.000 0.750 0.625 0.375 0.375
It is just the values of the empirical cdf:
julia> ecdf(x)(x)'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
0.125 0.625 0.875 1.0 0.75 0.625 0.375 0.375
julia> R"dense_rank(x)"
RObject{IntSxp}
[1] 1 3 5 6 4 3 2 2
julia> denserank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 3 5 6 4 3 2 2
julia> R"min_rank(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 4 2 2
Julia equivalent is the competerank
:
julia> competerank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 4 7 8 6 4 2 2
TODO The dplyr
ntile
function seems rather unstandard and so the results differ from Julia:
julia> R"ntile(x, 3)"
RObject{IntSxp}
[1] 1 2 3 3 2 2 1 1
julia> searchsortedfirst.(Ref(nquantile(x,2)), x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 2 3 3 3 2 2 2
As for percen_rank
it is just a normalization to the 0-1 interval, hence requires subtracting 1 first:
julia> R"percent_rank(x)"
RObject{RealSxp}
[1] 0.0000000 0.4285714 0.8571429 1.0000000 0.7142857 0.4285714 0.1428571
[8] 0.1428571
julia> ((competerank(x).-1)/(length(x)-1))'
1×8 adjoint(::Vector{Float64}) with eltype Float64:
0.0 0.428571 0.857143 1.0 0.714286 0.428571 0.142857 0.142857
As for row_number
rank:
julia> R"row_number(x)"
RObject{IntSxp}
[1] 1 4 7 8 6 5 2 3
julia> ordinalrank(x)'
1×8 adjoint(::Vector{Int64}) with eltype Int64:
1 4 7 8 6 5 2 3
between
#
Julia simply understands the between notion in a mathematical notation:
julia> R"between(x, 3, 6)"
RObject{LglSxp}
[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
julia> (3 .<= x .<= 6)'
1×8 adjoint(::BitVector) with eltype Bool:
0 1 1 0 1 1 0 0
near
#
Use the approximate notation (or the isapprox
function):
julia> R"near(2, 1.9999999999)"
RObject{LglSxp}
[1] TRUE
julia> 2 ≈ 1.9999999999
true
case_when
#
Julia doesn’t have a syntax for cases yet. The shortest equivalent that comes to mind is as follows:
julia> R"""
case_when(x > 3 & x < 5 ~ "medium",
x <= 3 ~ "small",
x >= 5 ~ "large",
TRUE ~ "unknown")
"""
RObject{StrSxp}
[1] "small" "small" "large" "large" "medium" "small" "small" "small"
julia> map(x) do x_
if (3 < x_ < 5) "medium"
elseif (x_ <= 3) "small"
elseif (5 <= x_) "large"
else "unknown"
end
end
8-element Vector{String}:
"small"
"small"
"large"
"large"
"medium"
"small"
"small"
"small"
coalesce
#
Both functions are inspired by the SQL command and so work nearly identically:
julia> R"""
y <- c(NA, 2, 5, NA, 6)
z <- c(3, 7, NA, 9, 4)
coalesce(y,z)
"""
RObject{RealSxp}
[1] 3 2 5 9 6
julia> y = [missing, 2, 5, missing, 6];
julia> z = [3, 7, missing, 9, 4];
julia> coalesce.(y, z)'
1×5 adjoint(::Vector{Int64}) with eltype Int64:
3 2 5 9 6
It is important to note that vectorizing coalesce (i.e. the dot) is necessary. Otherwise, since the y
vector isn’t missing, nothing happens.
if_else
#
The julia equivalent is the vectorized ifelse
:
julia> R"""
if_else(x > 3, "large", "small")
"""
RObject{StrSxp}
[1] "small" "small" "large" "large" "large" "small" "small" "small"
julia> ifelse.(x .> 3, "large", "small")
8-element Vector{String}:
"small"
"small"
"large"
"large"
"large"
"small"
"small"
"small"
na_if
#
Simply use the more general functions in Julia:
julia> R"na_if(x, 9)"
RObject{RealSxp}
[1] 1 3 5 NA 4 3 2 2
julia> replace(x, 9=>missing)'
1×8 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
1 3 5 missing 4 3 2 2
pmax
and pmin
#
julia> R"pmax(y,z)"
RObject{RealSxp}
[1] NA 7 NA NA 6
julia> max.(y,z)'
1×5 adjoint(::Vector{Union{Missing, Int64}}) with eltype Union{Missing, Int64}:
missing 7 missing missing 6
recode
#
Again, simply using the more general function replace
:
julia> R"""
cvec <- c("a", "b", "a", "c", "b", "a", "c")
recode(cvec, a = "apple")
"""
RObject{StrSxp}
[1] "apple" "b" "apple" "c" "b" "apple" "c"
julia> cvec = ["a", "b", "a", "c", "b", "a", "c"];
julia> replace(cvec, "a"=>"apple")
7-element Vector{String}:
"apple"
"b"
"apple"
"c"
"b"
"apple"
"c"
Summary Functions #
Back to dataframes where summary functions are better illustrated.
n()
#
julia> R"mpg %>% group_by(cyl) %>% summarise(n=n())"
RObject{VecSxp}
# A tibble: 4 x 2
cyl n
<int> <int>
1 4 81
2 5 4
3 6 79
4 8 70
julia> groupby(mpg, :Cyl) |> x->combine(x, nrow)
4×2 DataFrame
Row │ Cyl nrow
│ Int32 Int64
─────┼──────────────
1 │ 4 81
2 │ 5 4
3 │ 6 79
4 │ 8 70
n_distinct
#
julia> R"mpg %>% group_by(cyl) %>% summarise(trans=n_distinct(trans))"
RObject{VecSxp}
# A tibble: 4 x 2
cyl trans
<int> <int>
1 4 9
2 5 2
3 6 8
4 8 8
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans =>
(x_->length(unique(x_))) => :Trans)
4×2 DataFrame
Row │ Cyl Trans
│ Int32 Int64
─────┼──────────────
1 │ 4 9
2 │ 5 2
3 │ 6 8
4 │ 8 8
Alternatively, one could just composition the two inner functions instead (use \circ to write the operator):
julia> groupby(mpg, :Cyl) |> x->combine(x, :Trans => length ∘ unique => :Trans)
4×2 DataFrame
Row │ Cyl Trans
│ Int32 Int64
─────┼──────────────
1 │ 4 9
2 │ 5 2
3 │ 6 8
4 │ 8 8
mean
and median
#
julia> R"mean(x)"
RObject{RealSxp}
[1] 3.625
julia> mean(x)
3.625
julia> R"median(x)"
RObject{RealSxp}
[1] 3
julia> median(x)
3.0
first
, last
, nth
#
The Julia equivalents are also named first and last. As for nth
, one has to simply use array indexing.
Rank and Spread functions #
These are all available in Julia by similar names:
julia> R"c(max(x), min(x), quantile(x, 0.25), IQR(x), mad(x), sd(x), var(x))"
RObject{RealSxp}
25%
9.000000 1.000000 2.000000 2.250000 1.482600 2.503569 6.267857
julia> [maximum(x); minimum(x); quantile(x, 0.25); iqr(x); mad(x); std(x); var(x)]'
1×7 adjoint(::Vector{Float64}) with eltype Float64:
9.0 1.0 2.0 2.25 1.4826 2.50357 6.26786
Row Names #
Julia dataframes do not have row names by design so there is nothing to worry about here.
Combining Tables #
Combining tables is really easy in Julia and is similar to how it is done in R for the most part (since both are inspired by SQL verbs).
Column and row binding #
Binding columns is done with the hcat
function:
julia> R"""
df1 <- tibble(a = 1:3, b = 4:6)
df2 <- tibble(c = 7:9, d = 10:12)
bind_cols(df1, df2)
"""
RObject{VecSxp}
# A tibble: 3 x 4
a b c d
<int> <int> <int> <int>
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
julia> df1 = DataFrame(a=1:3, b=4:6);
julia> df2 = DataFrame(c=7:9, d=10:12);
julia> hcat(df1, df2)
3×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 4 7 10
2 │ 2 5 8 11
3 │ 3 6 9 12
Binding rows requires vcat:
julia> R"""
df1 <- tibble(a = 1:3, b = 4:6)
df2 <- tibble(a = 7:9, b = 10:12)
bind_rows(df1, df2)
"""
RObject{VecSxp}
# A tibble: 6 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
4 7 10
5 8 11
6 9 12
julia> df1 = DataFrame(a=1:3, b=4:6);
julia> df2 = DataFrame(a=7:9, b=10:12);
julia> vcat(df1, df2)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
4 │ 7 10
5 │ 8 11
6 │ 9 12
Joins #
Example data taken from DataFrames.jl
:
julia> df1 = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Category = [1, 2, 3, 4, 5]);
julia> df2 = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Name = ["a", "b", "c", "d", "e"]);
julia> innerjoin(df1, df2, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
Row │ City Job Category Name
│ String String Int64 String
─────┼─────────────────────────────────────
1 │ Amsterdam Lawyer 1 a
2 │ London Lawyer 2 b
3 │ London Lawyer 3 b
4 │ London Lawyer 2 c
5 │ London Lawyer 3 c
6 │ New York Doctor 4 d
7 │ New York Doctor 5 d
8 │ New York Doctor 4 e
9 │ New York Doctor 5 e
In R, it would be:
julia> R"""
df1 <- tibble(City = c("Amsterdam", "London", "London", "New York", "New York"),
Job = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
Category = c(1, 2, 3, 4, 5))
df2 <- tibble(Location = c("Amsterdam", "London", "London", "New York", "New York"),
Work = c("Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"),
Name = c("a", "b", "c", "d", "e"))
inner_join(df1, df2, by = c("City" = "Location", "Job" = "Work"))
"""
RObject{VecSxp}
# A tibble: 9 x 4
City Job Category Name
<chr> <chr> <dbl> <chr>
1 Amsterdam Lawyer 1 a
2 London Lawyer 2 b
3 London Lawyer 2 c
4 London Lawyer 3 b
5 London Lawyer 3 c
6 New York Doctor 4 d
7 New York Doctor 4 e
8 New York Doctor 5 d
9 New York Doctor 5 e
Set operations in R are somewhat redunant in the presence of join verbs, and so in Julia one needs to rely soley on those join verbs.
This concludes the dplyr
cheatsheet translation to DataFrames.jl
commands.