Introduction
With the release of v1.0 of the DataFrames.jl
package, it would seem appropriate to introduce a rather comprehensive cheatsheet of it. One that is of special use to people who come from tidyverse
(arguably, the best data transformation syntax there is for combined expressiveness and brevity).
The order of topics is the same as dplyr
cheatsheet.
using RDatasets, RCall, DataFrames, StatsBase, InteractiveErrors
R"library(dplyr)";
Setting up the data using RCall
data = dataset("datasets", "iris")
@rput data;
Summarize Cases
summarize
julia> R"summarize(data, avSL = mean(SepalLength), avSW = mean(SepalWidth))"
RObject{VecSxp}
avSL avSW
1 5.843333 3.057333
In Julia:
julia> combine(data, [:SepalLength, :SepalWidth] .=> mean .=> [:avSL, :avSW])
1×2 DataFrame
Row │ avSL avSW
│ Float64 Float64
─────┼──────────────────
1 │ 5.84333 3.05733
There are two important things to note. Firstly, broadcasting is mandatory. Secondly, the syntax allows for one factoring of the function in the middle.
Now let us summarize by different functions:
julia> combine(data, [:SepalLength, :SepalWidth] .=> [maximum, minimum] .=>
[:maxSL, :minSW])
1×2 DataFrame
Row │ maxSL minSW
│ Float64 Float64
─────┼──────────────────
1 │ 7.9 2.0
And with anonymous functions:
julia> combine(data, [:SepalLength, :SepalWidth] .=>
[x->sum(x./100), x->sum(sqrt.(x))] .=> [:x1, :x2])
1×2 DataFrame
Row │ x1 x2
│ Float64 Float64
─────┼──────────────────
1 │ 8.765 261.619
The r equivalent would be:
julia> R"""
summarize(data, x1=sum(SepalLength/100), x2=sum(sqrt(SepalWidth)))
"""
RObject{VecSxp}
x1 x2
1 8.765 261.6187
The reason that r is often-times nicer is quotation. With DataFrames.jl on the other hand, the idea is to simply pass the actual data and data identifiers around. This means greater flexibility but slightly more verbose syntax.
For instance, consider summarizing by matching column names:
julia> combine(data, filter(x->occursin.("Sepal", x), names(data)) .=> mean)
1×2 DataFrame
Row │ SepalLength_mean SepalWidth_mean
│ Float64 Float64
─────┼───────────────────────────────────
1 │ 5.84333 3.05733
In R, one needs to explicitly use other function to escape the quoting procedure:
julia> R"""
summarize(data, across(contains("Sepal"), mean))
"""
RObject{VecSxp}
SepalLength SepalWidth
1 5.843333 3.057333
Both have their ups and downs. Basically, the field of nonstandard evaluation will require its own functions, whereas, by avoiding that, one can rely on existing functions or at least only slightly modified functions (based on multiple dispatch for instance) to arrive at the desired results.
count
data = dataset("ggplot2", "mpg")
@rput data;
julia> R"""
data <- as_tibble(data)
count(data, Cyl, Drv)
"""
RObject{VecSxp}
# A tibble: 9 x 3
Cyl Drv n
<int> <fct> <int>
1 4 4 23
2 4 f 58
3 5 f 4
4 6 4 32
5 6 f 43
6 6 r 4
7 8 4 48
8 8 f 1
9 8 r 21
Any function that operates on (sub)dataframes can be used immediately (nrow here):
julia> combine(groupby(data, [:Cyl, :Drv], sort = true), nrow)
9×3 DataFrame
Row │ Cyl Drv nrow
│ Int32 Cat… Int64
─────┼────────────────────
1 │ 4 4 23
2 │ 4 f 58
3 │ 5 f 4
4 │ 6 4 32
5 │ 6 f 43
6 │ 6 r 4
7 │ 8 4 48
8 │ 8 f 1
9 │ 8 r 21
This was a proper interlude to the next topic, grouping.
Group Cases
Ungroup
The general grouping operation in DataFrames.jl is as above, whereby a “sub” dataframe is recognized, perhaps not overly different from views in arrays.
Ungrouping is done rather automatically. The default value of ungroup
in the transformation functions is true
.
julia> df = groupby(data, [:Cyl, :Drv], sort = true)
GroupedDataFrame with 9 groups based on keys: Cyl, Drv
First Group (23 rows): Cyl = 4, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "4"
Row │ Manufacturer Model Displ Year Cyl Trans D ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… C ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 quattro 1.8 1999 4 manual(m5) 4 ⋯
2 │ audi a4 quattro 1.8 1999 4 auto(l5) 4
3 │ audi a4 quattro 2.0 2008 4 manual(m6) 4
4 │ audi a4 quattro 2.0 2008 4 auto(s6) 4
5 │ subaru forester awd 2.5 1999 4 manual(m5) 4 ⋯
6 │ subaru forester awd 2.5 1999 4 auto(l4) 4
7 │ subaru forester awd 2.5 2008 4 manual(m5) 4
8 │ subaru forester awd 2.5 2008 4 manual(m5) 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
16 │ subaru impreza awd 2.5 2008 4 auto(s4) 4 ⋯
17 │ subaru impreza awd 2.5 2008 4 manual(m5) 4
18 │ subaru impreza awd 2.5 2008 4 manual(m5) 4
19 │ toyota 4runner 4wd 2.7 1999 4 manual(m5) 4
20 │ toyota 4runner 4wd 2.7 1999 4 auto(l4) 4 ⋯
21 │ toyota toyota tacoma 4wd 2.7 1999 4 manual(m5) 4
22 │ toyota toyota tacoma 4wd 2.7 1999 4 auto(l4) 4
23 │ toyota toyota tacoma 4wd 2.7 2008 4 manual(m5) 4
5 columns and 7 rows omitted
⋮
Last Group (21 rows): Cyl = 8, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "r"
Row │ Manufacturer Model Displ Year Cyl Trans ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) ⋯
2 │ chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4)
3 │ chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4)
4 │ chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4)
5 │ chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) ⋯
6 │ chevrolet corvette 5.7 1999 8 manual(m6)
7 │ chevrolet corvette 5.7 1999 8 auto(l4)
8 │ chevrolet corvette 6.2 2008 8 manual(m6)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
14 │ ford mustang 4.6 1999 8 auto(l4) ⋯
15 │ ford mustang 4.6 1999 8 manual(m5)
16 │ ford mustang 4.6 2008 8 manual(m5)
17 │ ford mustang 4.6 2008 8 auto(l5)
18 │ ford mustang 5.4 2008 8 manual(m6) ⋯
19 │ lincoln navigator 2wd 5.4 1999 8 auto(l4)
20 │ lincoln navigator 2wd 5.4 1999 8 auto(l4)
21 │ lincoln navigator 2wd 5.4 2008 8 auto(l6)
5 columns and 5 rows omitted
julia> df2 = combine(df, :)
234×11 DataFrame
Row │ Cyl Drv Manufacturer Model Displ Year Trans ⋯
│ Int32 Cat… Cat… Cat… Float64 Int32 Cat… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 4 4 audi a4 quattro 1.8 1999 manual(m5) ⋯
2 │ 4 4 audi a4 quattro 1.8 1999 auto(l5)
3 │ 4 4 audi a4 quattro 2.0 2008 manual(m6)
4 │ 4 4 audi a4 quattro 2.0 2008 auto(s6)
5 │ 4 4 subaru forester awd 2.5 1999 manual(m5) ⋯
6 │ 4 4 subaru forester awd 2.5 1999 auto(l4)
7 │ 4 4 subaru forester awd 2.5 2008 manual(m5)
8 │ 4 4 subaru forester awd 2.5 2008 manual(m5)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ 8 r ford mustang 4.6 1999 manual(m5) ⋯
229 │ 8 r ford mustang 4.6 2008 manual(m5)
230 │ 8 r ford mustang 4.6 2008 auto(l5)
231 │ 8 r ford mustang 5.4 2008 manual(m6)
232 │ 8 r lincoln navigator 2wd 5.4 1999 auto(l4) ⋯
233 │ 8 r lincoln navigator 2wd 5.4 1999 auto(l4)
234 │ 8 r lincoln navigator 2wd 5.4 2008 auto(l6)
4 columns and 219 rows omitted
The colon indicates that all items should be retained (whether columns or rows, depending on the specific argument it is used in place of). Here, it means to keep all rows and columns. However, since combine does ungrouping by default, that means retaining the original content of the data:
julia> nrow(df2) == nrow(data)
true
It is important to note, however, that presently, grouping changes the row order.
julia> df3 = combine(data, :)
234×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 auto(l5) f ⋯
2 │ audi a4 1.8 1999 4 manual(m5) f
3 │ audi a4 2.0 2008 4 manual(m6) f
4 │ audi a4 2.0 2008 4 auto(av) f
5 │ audi a4 2.8 1999 6 auto(l5) f ⋯
6 │ audi a4 2.8 1999 6 manual(m5) f
7 │ audi a4 3.1 2008 6 auto(av) f
8 │ audi a4 quattro 1.8 1999 4 manual(m5) 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ volkswagen passat 1.8 1999 4 manual(m5) f ⋯
229 │ volkswagen passat 1.8 1999 4 auto(l5) f
230 │ volkswagen passat 2.0 2008 4 auto(s6) f
231 │ volkswagen passat 2.0 2008 4 manual(m6) f
232 │ volkswagen passat 2.8 1999 6 auto(l5) f ⋯
233 │ volkswagen passat 2.8 1999 6 manual(m5) f
234 │ volkswagen passat 3.6 2008 6 auto(s6) f
5 columns and 219 rows omitted
julia> df3 == data
true
whereas:
julia> df2 == data
false
In r, use of the ungroup
function is needed.
Row-wise operation
Row-wise is, conceptually, setting each row as its own group. Nevertheless, oftentimes we do not need to think this way.
Say you want to know the minimum between 1.5 times the city miles per gallon and highway miles per galon for each model. In r, it would require this:
julia> R"""
data %>% rowwise() %>% mutate(maxy = max(1.5*Cty, Hwy)) %>%
select(maxy, everything())
"""
RObject{VecSxp}
# A tibble: 234 x 12
# Rowwise:
maxy Manufacturer Model Displ Year Cyl Trans Drv Cty Hwy Fl
<dbl> <fct> <fct> <dbl> <int> <int> <fct> <fct> <int> <int> <fct>
1 29 audi a4 1.8 1999 4 auto(l… f 18 29 p
2 31.5 audi a4 1.8 1999 4 manual… f 21 29 p
3 31 audi a4 2 2008 4 manual… f 20 31 p
4 31.5 audi a4 2 2008 4 auto(a… f 21 30 p
5 26 audi a4 2.8 1999 6 auto(l… f 16 26 p
6 27 audi a4 2.8 1999 6 manual… f 18 26 p
7 27 audi a4 3.1 2008 6 auto(a… f 18 27 p
8 27 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p
9 25 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p
10 30 audi a4 quat… 2 2008 4 manual… 4 20 28 p
# … with 224 more rows, and 1 more variable: Class <fct>
In Julia, it is simply a matter of specifying the operation itself to rowwise:
julia> transform(data, [:Cty, :Hwy] => ByRow((x,y)->max(1.5*x,y)) => :maxy) |>
x->x[:, Cols([:maxy], :)]
234×12 DataFrame
Row │ maxy Manufacturer Model Displ Year Cyl Trans ⋯
│ Float64 Categorical… Categorical… Float64 Int32 Int32 Categorical ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 29.0 audi a4 1.8 1999 4 auto(l5) ⋯
2 │ 31.5 audi a4 1.8 1999 4 manual(m5)
3 │ 31.0 audi a4 2.0 2008 4 manual(m6)
4 │ 31.5 audi a4 2.0 2008 4 auto(av)
5 │ 26.0 audi a4 2.8 1999 6 auto(l5) ⋯
6 │ 27.0 audi a4 2.8 1999 6 manual(m5)
7 │ 27.0 audi a4 3.1 2008 6 auto(av)
8 │ 27.0 audi a4 quattro 1.8 1999 4 manual(m5)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ 31.5 volkswagen passat 1.8 1999 4 manual(m5) ⋯
229 │ 29.0 volkswagen passat 1.8 1999 4 auto(l5)
230 │ 28.5 volkswagen passat 2.0 2008 4 auto(s6)
231 │ 31.5 volkswagen passat 2.0 2008 4 manual(m6)
232 │ 26.0 volkswagen passat 2.8 1999 6 auto(l5) ⋯
233 │ 27.0 volkswagen passat 2.8 1999 6 manual(m5)
234 │ 26.0 volkswagen passat 3.6 2008 6 auto(s6)
6 columns and 219 rows omitted
Manipulate Cases
Filter
Filtering essentially uses the same principles as applied to any other Julia data structure, and can even use the conventional filter
function for this purpose:
julia> filter(:Cty => x->(x>20), data)[:, Cols(:Cty,:)]
45×11 DataFrame
Row │ Cty Manufacturer Model Displ Year Cyl Trans ⋯
│ Int32 Categorical… Categorical… Float64 Int32 Int32 Categorical… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 21 audi a4 1.8 1999 4 manual(m5) ⋯
2 │ 21 audi a4 2.0 2008 4 auto(av)
3 │ 22 chevrolet malibu 2.4 2008 4 auto(l4)
4 │ 28 honda civic 1.6 1999 4 manual(m5)
5 │ 24 honda civic 1.6 1999 4 auto(l4) ⋯
6 │ 25 honda civic 1.6 1999 4 manual(m5)
7 │ 23 honda civic 1.6 1999 4 manual(m5)
8 │ 24 honda civic 1.6 1999 4 auto(l4)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
39 │ 21 volkswagen jetta 2.5 2008 5 auto(s6) ⋯
40 │ 21 volkswagen jetta 2.5 2008 5 manual(m5)
41 │ 35 volkswagen new beetle 1.9 1999 4 manual(m5)
42 │ 29 volkswagen new beetle 1.9 1999 4 auto(l4)
43 │ 21 volkswagen new beetle 2.0 1999 4 manual(m5) ⋯
44 │ 21 volkswagen passat 1.8 1999 4 manual(m5)
45 │ 21 volkswagen passat 2.0 2008 4 manual(m6)
4 columns and 30 rows omitted
Julia data collection functions usually take a function as the first argument in order to facilitate the use of do
blocks. This doesn’t look nice when using piping. Therefore, my suggestion is to stick with the subset
function. Besides, apparently, filter
doesn’t work on data groups.
julia> subset(data, :Cty => x->(x.>20))[:, Cols(:Cty,:)]
45×11 DataFrame
Row │ Cty Manufacturer Model Displ Year Cyl Trans ⋯
│ Int32 Categorical… Categorical… Float64 Int32 Int32 Categorical… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 21 audi a4 1.8 1999 4 manual(m5) ⋯
2 │ 21 audi a4 2.0 2008 4 auto(av)
3 │ 22 chevrolet malibu 2.4 2008 4 auto(l4)
4 │ 28 honda civic 1.6 1999 4 manual(m5)
5 │ 24 honda civic 1.6 1999 4 auto(l4) ⋯
6 │ 25 honda civic 1.6 1999 4 manual(m5)
7 │ 23 honda civic 1.6 1999 4 manual(m5)
8 │ 24 honda civic 1.6 1999 4 auto(l4)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
39 │ 21 volkswagen jetta 2.5 2008 5 auto(s6) ⋯
40 │ 21 volkswagen jetta 2.5 2008 5 manual(m5)
41 │ 35 volkswagen new beetle 1.9 1999 4 manual(m5)
42 │ 29 volkswagen new beetle 1.9 1999 4 auto(l4)
43 │ 21 volkswagen new beetle 2.0 1999 4 manual(m5) ⋯
44 │ 21 volkswagen passat 1.8 1999 4 manual(m5)
45 │ 21 volkswagen passat 2.0 2008 4 manual(m6)
4 columns and 30 rows omitted
Distinct
julia> unique(data, [:Cyl, :Drv])
9×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 auto(l5) ⋯
2 │ audi a4 2.8 1999 6 auto(l5)
3 │ audi a4 quattro 1.8 1999 4 manual(m5)
4 │ audi a4 quattro 2.8 1999 6 auto(l5)
5 │ audi a6 quattro 4.2 2008 8 auto(s6) ⋯
6 │ chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4)
7 │ ford mustang 3.8 1999 6 manual(m5)
8 │ pontiac grand prix 5.3 2008 8 auto(s4)
9 │ volkswagen jetta 2.5 2008 5 auto(s6) ⋯
5 columns omitted
julia> R"""
distinct(data, Cyl, Drv)
"""
RObject{VecSxp}
# A tibble: 9 x 2
Cyl Drv
<int> <fct>
1 4 f
2 6 f
3 4 4
4 6 4
5 8 4
6 8 r
7 6 r
8 8 f
9 5 f
Slice
The point of a function such as slice
is to be used in pipes really. Otherwise one can simply do this:
julia> R"""
data[4:10,]
"""
RObject{VecSxp}
# A tibble: 7 x 11
Manufacturer Model Displ Year Cyl Trans Drv Cty Hwy Fl Class
<fct> <fct> <dbl> <int> <int> <fct> <fct> <int> <int> <fct> <fct>
1 audi a4 2 2008 4 auto(a… f 21 30 p compa…
2 audi a4 2.8 1999 6 auto(l… f 16 26 p compa…
3 audi a4 2.8 1999 6 manual… f 18 26 p compa…
4 audi a4 3.1 2008 6 auto(a… f 18 27 p compa…
5 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p compa…
6 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p compa…
7 audi a4 quat… 2 2008 4 manual… 4 20 28 p compa…
Same with julia:
julia> data[4:10,:]
7×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 2.0 2008 4 auto(av) f ⋯
2 │ audi a4 2.8 1999 6 auto(l5) f
3 │ audi a4 2.8 1999 6 manual(m5) f
4 │ audi a4 3.1 2008 6 auto(av) f
5 │ audi a4 quattro 1.8 1999 4 manual(m5) 4 ⋯
6 │ audi a4 quattro 1.8 1999 4 auto(l5) 4
7 │ audi a4 quattro 2.0 2008 4 manual(m6) 4
5 columns omitted
Slice-sample
The StatsBase module can be used here.
julia> data[rand(1:nrow(data), 10),:]
10×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans D ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… C ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ ford explorer 4wd 4.0 1999 6 auto(l5) 4 ⋯
2 │ nissan altima 2.4 1999 4 auto(l4) f
3 │ volkswagen new beetle 1.9 1999 4 manual(m5) f
4 │ hyundai tiburon 2.7 2008 6 manual(m5) f
5 │ toyota toyota tacoma 4wd 2.7 1999 4 manual(m5) 4 ⋯
6 │ chevrolet corvette 6.2 2008 8 manual(m6) r
7 │ hyundai sonata 3.3 2008 6 auto(l5) f
8 │ dodge durango 4wd 5.9 1999 8 auto(l4) 4
9 │ ford f150 pickup 4wd 4.6 1999 8 manual(m5) 4 ⋯
10 │ volkswagen new beetle 2.5 2008 5 auto(s6) f
5 columns omitted
Slice min and max
This is just a convenience function:
julia> data[data.Cty .>= quantile(data.Cty, 0.75),:]
76×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 manual(m5) f ⋯
2 │ audi a4 2.0 2008 4 manual(m6) f
3 │ audi a4 2.0 2008 4 auto(av) f
4 │ audi a4 quattro 2.0 2008 4 manual(m6) 4
5 │ audi a4 quattro 2.0 2008 4 auto(s6) 4 ⋯
6 │ chevrolet malibu 2.4 1999 4 auto(l4) f
7 │ chevrolet malibu 2.4 2008 4 auto(l4) f
8 │ honda civic 1.6 1999 4 manual(m5) f
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
70 │ volkswagen new beetle 2.0 1999 4 manual(m5) f ⋯
71 │ volkswagen new beetle 2.0 1999 4 auto(l4) f
72 │ volkswagen new beetle 2.5 2008 5 manual(m5) f
73 │ volkswagen new beetle 2.5 2008 5 auto(s6) f
74 │ volkswagen passat 1.8 1999 4 manual(m5) f ⋯
75 │ volkswagen passat 2.0 2008 4 auto(s6) f
76 │ volkswagen passat 2.0 2008 4 manual(m6) f
5 columns and 61 rows omitted
The R code:
julia> R"""
slice_max(data, Cty, prop = 0.25)
"""
RObject{VecSxp}
# A tibble: 76 x 11
Manufacturer Model Displ Year Cyl Trans Drv Cty Hwy Fl Class
<fct> <fct> <dbl> <int> <int> <fct> <fct> <int> <int> <fct> <fct>
1 volkswagen new be… 1.9 1999 4 manual… f 35 44 d subco…
2 volkswagen jetta 1.9 1999 4 manual… f 33 44 d compa…
3 volkswagen new be… 1.9 1999 4 auto(l… f 29 41 d subco…
4 honda civic 1.6 1999 4 manual… f 28 33 r subco…
5 toyota corolla 1.8 2008 4 manual… f 28 37 r compa…
6 honda civic 1.8 2008 4 manual… f 26 34 r subco…
7 toyota corolla 1.8 1999 4 manual… f 26 35 r compa…
8 toyota corolla 1.8 2008 4 auto(l… f 26 35 r compa…
9 honda civic 1.6 1999 4 manual… f 25 32 r subco…
10 honda civic 1.8 2008 4 auto(l… f 25 36 r subco…
# … with 66 more rows
Slice head and tail
again, mostly meaningful for pipes. Nevertheless:
julia> data[end-5:end,:]
6×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ volkswagen passat 1.8 1999 4 auto(l5) f ⋯
2 │ volkswagen passat 2.0 2008 4 auto(s6) f
3 │ volkswagen passat 2.0 2008 4 manual(m6) f
4 │ volkswagen passat 2.8 1999 6 auto(l5) f
5 │ volkswagen passat 2.8 1999 6 manual(m5) f ⋯
6 │ volkswagen passat 3.6 2008 6 auto(s6) f
5 columns omitted
Arrange
Instead of arrange, sort
is used. The syntax is very clear:
julia> sort(data, [order(:Year, rev=true), :Displ])
234×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ honda civic 1.8 2008 4 manual(m5) ⋯
2 │ honda civic 1.8 2008 4 auto(l5)
3 │ honda civic 1.8 2008 4 auto(l5)
4 │ toyota corolla 1.8 2008 4 manual(m5)
5 │ toyota corolla 1.8 2008 4 auto(l4) ⋯
6 │ audi a4 2.0 2008 4 manual(m6)
7 │ audi a4 2.0 2008 4 auto(av)
8 │ audi a4 quattro 2.0 2008 4 manual(m6)
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4) ⋯
229 │ chevrolet corvette 5.7 1999 8 manual(m6)
230 │ chevrolet corvette 5.7 1999 8 auto(l4)
231 │ chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4)
232 │ dodge durango 4wd 5.9 1999 8 auto(l4) ⋯
233 │ dodge ram 1500 pickup 4wd 5.9 1999 8 auto(l4)
234 │ chevrolet k1500 tahoe 4wd 6.5 1999 8 auto(l4)
5 columns and 219 rows omitted
julia> R"""
arrange(data, desc(Year), Displ)
"""
RObject{VecSxp}
# A tibble: 234 x 11
Manufacturer Model Displ Year Cyl Trans Drv Cty Hwy Fl Class
<fct> <fct> <dbl> <int> <int> <fct> <fct> <int> <int> <fct> <fct>
1 honda civic 1.8 2008 4 manual… f 26 34 r subco…
2 honda civic 1.8 2008 4 auto(l… f 25 36 r subco…
3 honda civic 1.8 2008 4 auto(l… f 24 36 c subco…
4 toyota corolla 1.8 2008 4 manual… f 28 37 r compa…
5 toyota corolla 1.8 2008 4 auto(l… f 26 35 r compa…
6 audi a4 2 2008 4 manual… f 20 31 p compa…
7 audi a4 2 2008 4 auto(a… f 21 30 p compa…
8 audi a4 qua… 2 2008 4 manual… 4 20 28 p compa…
9 audi a4 qua… 2 2008 4 auto(s… 4 19 27 p compa…
10 honda civic 2 2008 4 manual… f 21 29 p subco…
# … with 224 more rows
One can in fact use any kind of sorting. See the sorting section in DataFrames.jl.
Adding row at position
Again, this is just a convenience function. Let’s add a duplicate row for instance:
julia> vcat(data, DataFrame(data[5,:]))
235×11 DataFrame
Row │ Manufacturer Model Displ Year Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 auto(l5) f ⋯
2 │ audi a4 1.8 1999 4 manual(m5) f
3 │ audi a4 2.0 2008 4 manual(m6) f
4 │ audi a4 2.0 2008 4 auto(av) f
5 │ audi a4 2.8 1999 6 auto(l5) f ⋯
6 │ audi a4 2.8 1999 6 manual(m5) f
7 │ audi a4 3.1 2008 6 auto(av) f
8 │ audi a4 quattro 1.8 1999 4 manual(m5) 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
229 │ volkswagen passat 1.8 1999 4 auto(l5) f ⋯
230 │ volkswagen passat 2.0 2008 4 auto(s6) f
231 │ volkswagen passat 2.0 2008 4 manual(m6) f
232 │ volkswagen passat 2.8 1999 6 auto(l5) f
233 │ volkswagen passat 2.8 1999 6 manual(m5) f ⋯
234 │ volkswagen passat 3.6 2008 6 auto(s6) f
235 │ audi a4 2.8 1999 6 auto(l5) f
5 columns and 220 rows omitted
The issue is that specifying the position at which to insert the row can take extra work. See also append!
and push!
with the catch being modification of the original dataframe.
Extract variables
Conventional indexing can be used:
pull
One can just use the fact that the dataframe is a collection of vectors in a dictionary-like relation:
julia> data.Hwy
234-element Vector{Int32}:
29
29
31
30
26
26
27
26
25
28
⋮
28
29
29
29
28
29
26
26
26
julia> R"pull(data, Hwy)"
RObject{IntSxp}
[1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
[26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
[51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
[76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
[101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
[126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
[151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
[176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
[201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
[226] 28 29 29 29 28 29 26 26 26
select
julia> data[:, [:Hwy, :Cty]]
234×2 DataFrame
Row │ Hwy Cty
│ Int32 Int32
─────┼──────────────
1 │ 29 18
2 │ 29 21
3 │ 31 20
4 │ 30 21
5 │ 26 16
6 │ 26 18
7 │ 27 18
8 │ 26 18
⋮ │ ⋮ ⋮
228 │ 29 21
229 │ 29 18
230 │ 28 19
231 │ 29 21
232 │ 26 16
233 │ 26 18
234 │ 26 17
219 rows omitted
Note that when selecting only a single column, it is no different than the dot syntax and the return structure is not dataframe:
julia> data[:, :Hwy]
234-element Vector{Int32}:
29
29
31
30
26
26
27
26
25
28
⋮
28
29
29
29
28
29
26
26
26
Of course, wrapping in vector the column symbol gives the desired result:
julia> data[:, [:Hwy]]
234×1 DataFrame
Row │ Hwy
│ Int32
─────┼───────
1 │ 29
2 │ 29
3 │ 31
4 │ 30
5 │ 26
6 │ 26
7 │ 27
8 │ 26
⋮ │ ⋮
228 │ 29
229 │ 29
230 │ 28
231 │ 29
232 │ 26
233 │ 26
234 │ 26
219 rows omitted
relocate
Just use Cols
as before. For instance, to put columns starting with m at front:
julia> data[:, Cols(filter(x->occursin(r"^[mM].*", x), names(data)),
:Year, :Class, :)]
234×11 DataFrame
Row │ Manufacturer Model Year Class Displ Cyl Trans ⋯
│ Categorical… Categorical… Int32 Categorical… Float64 Int32 Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1999 compact 1.8 4 auto(l ⋯
2 │ audi a4 1999 compact 1.8 4 manual
3 │ audi a4 2008 compact 2.0 4 manual
4 │ audi a4 2008 compact 2.0 4 auto(a
5 │ audi a4 1999 compact 2.8 6 auto(l ⋯
6 │ audi a4 1999 compact 2.8 6 manual
7 │ audi a4 2008 compact 3.1 6 auto(a
8 │ audi a4 quattro 1999 compact 1.8 4 manual
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ volkswagen passat 1999 midsize 1.8 4 manual ⋯
229 │ volkswagen passat 1999 midsize 1.8 4 auto(l
230 │ volkswagen passat 2008 midsize 2.0 4 auto(s
231 │ volkswagen passat 2008 midsize 2.0 4 manual
232 │ volkswagen passat 1999 midsize 2.8 6 auto(l ⋯
233 │ volkswagen passat 1999 midsize 2.8 6 manual
234 │ volkswagen passat 2008 midsize 3.6 6 auto(s
5 columns and 219 rows omitted
To put Model after Year:
julia> data[:, Cols( setdiff(propertynames(data), [:Model]) |> x->insert!(x,
findfirst(x .== :Year) + 1, :Model))]
234×11 DataFrame
Row │ Manufacturer Displ Year Model Cyl Trans Drv ⋯
│ Categorical… Float64 Int32 Categorical… Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi 1.8 1999 a4 4 auto(l5) f ⋯
2 │ audi 1.8 1999 a4 4 manual(m5) f
3 │ audi 2.0 2008 a4 4 manual(m6) f
4 │ audi 2.0 2008 a4 4 auto(av) f
5 │ audi 2.8 1999 a4 6 auto(l5) f ⋯
6 │ audi 2.8 1999 a4 6 manual(m5) f
7 │ audi 3.1 2008 a4 6 auto(av) f
8 │ audi 1.8 1999 a4 quattro 4 manual(m5) 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ volkswagen 1.8 1999 passat 4 manual(m5) f ⋯
229 │ volkswagen 1.8 1999 passat 4 auto(l5) f
230 │ volkswagen 2.0 2008 passat 4 auto(s6) f
231 │ volkswagen 2.0 2008 passat 4 manual(m6) f
232 │ volkswagen 2.8 1999 passat 6 auto(l5) f ⋯
233 │ volkswagen 2.8 1999 passat 6 manual(m5) f
234 │ volkswagen 3.6 2008 passat 6 auto(s6) f
5 columns and 219 rows omitted
julia> R"""
relocate(data, Model, .after = Year)
"""
RObject{VecSxp}
# A tibble: 234 x 11
Manufacturer Displ Year Model Cyl Trans Drv Cty Hwy Fl Class
<fct> <dbl> <int> <fct> <int> <fct> <fct> <int> <int> <fct> <fct>
1 audi 1.8 1999 a4 4 auto(l… f 18 29 p comp…
2 audi 1.8 1999 a4 4 manual… f 21 29 p comp…
3 audi 2 2008 a4 4 manual… f 20 31 p comp…
4 audi 2 2008 a4 4 auto(a… f 21 30 p comp…
5 audi 2.8 1999 a4 6 auto(l… f 16 26 p comp…
6 audi 2.8 1999 a4 6 manual… f 18 26 p comp…
7 audi 3.1 2008 a4 6 auto(a… f 18 27 p comp…
8 audi 1.8 1999 a4 quat… 4 manual… 4 18 26 p comp…
9 audi 1.8 1999 a4 quat… 4 auto(l… 4 16 25 p comp…
10 audi 2 2008 a4 quat… 4 manual… 4 20 28 p comp…
# … with 224 more rows
Currently, there aren’t any helper functions for selection, though the general procedure isn’t that verbose in base Julia either.
Across function
As mentioned in the beginning, Julia provides quite flexible ways of combining data, and there is no need for extra across functions, mainly due to the employment of the broadcasting concept.
Making new variables
mutate, and transmutate
The mutate equivalent is transform. The select function is the transmutate equivalent, which only selects the specifided columns, and after applying the specified functions, returns only said columns.
rename
julia> rename(data, [:Model, :Year] .=> [:type, :date])
234×11 DataFrame
Row │ Manufacturer type Displ date Cyl Trans Drv ⋯
│ Categorical… Categorical… Float64 Int32 Int32 Categorical… Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ audi a4 1.8 1999 4 auto(l5) f ⋯
2 │ audi a4 1.8 1999 4 manual(m5) f
3 │ audi a4 2.0 2008 4 manual(m6) f
4 │ audi a4 2.0 2008 4 auto(av) f
5 │ audi a4 2.8 1999 6 auto(l5) f ⋯
6 │ audi a4 2.8 1999 6 manual(m5) f
7 │ audi a4 3.1 2008 6 auto(av) f
8 │ audi a4 quattro 1.8 1999 4 manual(m5) 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
228 │ volkswagen passat 1.8 1999 4 manual(m5) f ⋯
229 │ volkswagen passat 1.8 1999 4 auto(l5) f
230 │ volkswagen passat 2.0 2008 4 auto(s6) f
231 │ volkswagen passat 2.0 2008 4 manual(m6) f
232 │ volkswagen passat 2.8 1999 6 auto(l5) f ⋯
233 │ volkswagen passat 2.8 1999 6 manual(m5) f
234 │ volkswagen passat 3.6 2008 6 auto(s6) f
5 columns and 219 rows omitted
This concludes the first page of the cheatsheet.