Introduction #

With the release of v1.0 of the DataFrames.jl package, it would seem appropriate to introduce a rather comprehensive cheatsheet of it. One that is of special use to people who come from tidyverse (arguably, the best data transformation syntax there is for combined expressiveness and brevity).

The order of topics is the same as dplyr cheatsheet.

using RDatasets, RCall, DataFrames, StatsBase, InteractiveErrors

R"library(dplyr)";

Setting up the data using RCall #

data = dataset("datasets", "iris")

@rput data;

Summarize Cases #

`summarize` #

julia> R"summarize(data, avSL = mean(SepalLength), avSW = mean(SepalWidth))"
RObject{VecSxp}
      avSL     avSW
1 5.843333 3.057333

In Julia:

julia> combine(data, [:SepalLength, :SepalWidth] .=> mean .=> [:avSL, :avSW])
1×2 DataFrame
 Row │ avSL     avSW
     │ Float64  Float64
─────┼──────────────────
   1 │ 5.84333  3.05733

There are two important things to note. Firstly, broadcasting is mandatory. Secondly, the syntax allows for one factoring of the function in the middle.

Now let us summarize by different functions:

julia> combine(data, [:SepalLength, :SepalWidth] .=> [maximum, minimum] .=>
               [:maxSL, :minSW])
1×2 DataFrame
 Row │ maxSL    minSW
     │ Float64  Float64
─────┼──────────────────
   1 │     7.9      2.0

And with anonymous functions:

julia> combine(data, [:SepalLength, :SepalWidth] .=> 
               [x->sum(x./100), x->sum(sqrt.(x))] .=> [:x1, :x2])
1×2 DataFrame
 Row │ x1       x2
     │ Float64  Float64
─────┼──────────────────
   1 │   8.765  261.619

The r equivalent would be:

julia> R"""
       summarize(data, x1=sum(SepalLength/100), x2=sum(sqrt(SepalWidth)))
       """
RObject{VecSxp}
     x1       x2
1 8.765 261.6187

The reason that r is often-times nicer is quotation. With DataFrames.jl on the other hand, the idea is to simply pass the actual data and data identifiers around. This means greater flexibility but slightly more verbose syntax.

For instance, consider summarizing by matching column names:

julia> combine(data, filter(x->occursin.("Sepal", x), names(data)) .=> mean)
1×2 DataFrame
 Row │ SepalLength_mean  SepalWidth_mean
     │ Float64           Float64
─────┼───────────────────────────────────
   1 │          5.84333          3.05733

In R, one needs to explicitly use other function to escape the quoting procedure:

julia> R"""
       summarize(data, across(contains("Sepal"), mean))
       """
RObject{VecSxp}
  SepalLength SepalWidth
1    5.843333   3.057333

Both have their ups and downs. Basically, the field of nonstandard evaluation will require its own functions, whereas, by avoiding that, one can rely on existing functions or at least only slightly modified functions (based on multiple dispatch for instance) to arrive at the desired results.

`count` #

data = dataset("ggplot2", "mpg")

@rput data;

#

julia> R"""
       data <- as_tibble(data)
       count(data, Cyl, Drv)
       """
RObject{VecSxp}
# A tibble: 9 x 3
    Cyl Drv       n
  <int> <fct> <int>
1     4 4        23
2     4 f        58
3     5 f         4
4     6 4        32
5     6 f        43
6     6 r         4
7     8 4        48
8     8 f         1
9     8 r        21

Any function that operates on (sub)dataframes can be used immediately (nrow here):

julia> combine(groupby(data, [:Cyl, :Drv], sort = true), nrow)
9×3 DataFrame
 Row │ Cyl    Drv   nrow
     │ Int32  Cat…  Int64
─────┼────────────────────
   1 │     4  4        23
   2 │     4  f        58
   3 │     5  f         4
   4 │     6  4        32
   5 │     6  f        43
   6 │     6  r         4
   7 │     8  4        48
   8 │     8  f         1
   9 │     8  r        21

This was a proper interlude to the next topic, grouping.

Group Cases #

Ungroup #

The general grouping operation in DataFrames.jl is as above, whereby a “sub” dataframe is recognized, perhaps not overly different from views in arrays.

Ungrouping is done rather automatically. The default value of ungroup in the transformation functions is true.

julia> df = groupby(data, [:Cyl, :Drv], sort = true)
GroupedDataFrame with 9 groups based on keys: Cyl, Drv
First Group (23 rows): Cyl = 4, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "4"
 Row │ Manufacturer  Model              Displ    Year   Cyl    Trans         D ⋯
     │ Categorical…  Categorical…       Float64  Int32  Int32  Categorical…  C ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4 quattro             1.8   1999      4  manual(m5)    4 ⋯
   2 │ audi          a4 quattro             1.8   1999      4  auto(l5)      4
   3 │ audi          a4 quattro             2.0   2008      4  manual(m6)    4
   4 │ audi          a4 quattro             2.0   2008      4  auto(s6)      4
   5 │ subaru        forester awd           2.5   1999      4  manual(m5)    4 ⋯
   6 │ subaru        forester awd           2.5   1999      4  auto(l4)      4
   7 │ subaru        forester awd           2.5   2008      4  manual(m5)    4
   8 │ subaru        forester awd           2.5   2008      4  manual(m5)    4
  ⋮  │      ⋮                ⋮             ⋮       ⋮      ⋮         ⋮          ⋱
  16 │ subaru        impreza awd            2.5   2008      4  auto(s4)      4 ⋯
  17 │ subaru        impreza awd            2.5   2008      4  manual(m5)    4
  18 │ subaru        impreza awd            2.5   2008      4  manual(m5)    4
  19 │ toyota        4runner 4wd            2.7   1999      4  manual(m5)    4
  20 │ toyota        4runner 4wd            2.7   1999      4  auto(l4)      4 ⋯
  21 │ toyota        toyota tacoma 4wd      2.7   1999      4  manual(m5)    4
  22 │ toyota        toyota tacoma 4wd      2.7   1999      4  auto(l4)      4
  23 │ toyota        toyota tacoma 4wd      2.7   2008      4  manual(m5)    4
                                                    5 columns and 7 rows omitted
⋮
Last Group (21 rows): Cyl = 8, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "r"
 Row │ Manufacturer  Model               Displ    Year   Cyl    Trans          ⋯
     │ Categorical…  Categorical…        Float64  Int32  Int32  Categorical…   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)       ⋯
   2 │ chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   3 │ chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   4 │ chevrolet     c1500 suburban 2wd      5.7   1999      8  auto(l4)
   5 │ chevrolet     c1500 suburban 2wd      6.0   2008      8  auto(l4)       ⋯
   6 │ chevrolet     corvette                5.7   1999      8  manual(m6)
   7 │ chevrolet     corvette                5.7   1999      8  auto(l4)
   8 │ chevrolet     corvette                6.2   2008      8  manual(m6)
  ⋮  │      ⋮                ⋮              ⋮       ⋮      ⋮         ⋮         ⋱
  14 │ ford          mustang                 4.6   1999      8  auto(l4)       ⋯
  15 │ ford          mustang                 4.6   1999      8  manual(m5)
  16 │ ford          mustang                 4.6   2008      8  manual(m5)
  17 │ ford          mustang                 4.6   2008      8  auto(l5)
  18 │ ford          mustang                 5.4   2008      8  manual(m6)     ⋯
  19 │ lincoln       navigator 2wd           5.4   1999      8  auto(l4)
  20 │ lincoln       navigator 2wd           5.4   1999      8  auto(l4)
  21 │ lincoln       navigator 2wd           5.4   2008      8  auto(l6)
                                                    5 columns and 5 rows omitted

julia> df2 = combine(df, :)
234×11 DataFrame
 Row │ Cyl    Drv   Manufacturer  Model           Displ    Year   Trans        ⋯
     │ Int32  Cat…  Cat…          Cat…            Float64  Int32  Cat…         ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     4  4     audi          a4 quattro          1.8   1999  manual(m5)   ⋯
   2 │     4  4     audi          a4 quattro          1.8   1999  auto(l5)
   3 │     4  4     audi          a4 quattro          2.0   2008  manual(m6)
   4 │     4  4     audi          a4 quattro          2.0   2008  auto(s6)
   5 │     4  4     subaru        forester awd        2.5   1999  manual(m5)   ⋯
   6 │     4  4     subaru        forester awd        2.5   1999  auto(l4)
   7 │     4  4     subaru        forester awd        2.5   2008  manual(m5)
   8 │     4  4     subaru        forester awd        2.5   2008  manual(m5)
  ⋮  │   ⋮     ⋮         ⋮              ⋮            ⋮       ⋮        ⋮        ⋱
 228 │     8  r     ford          mustang             4.6   1999  manual(m5)   ⋯
 229 │     8  r     ford          mustang             4.6   2008  manual(m5)
 230 │     8  r     ford          mustang             4.6   2008  auto(l5)
 231 │     8  r     ford          mustang             5.4   2008  manual(m6)
 232 │     8  r     lincoln       navigator 2wd       5.4   1999  auto(l4)     ⋯
 233 │     8  r     lincoln       navigator 2wd       5.4   1999  auto(l4)
 234 │     8  r     lincoln       navigator 2wd       5.4   2008  auto(l6)
                                                  4 columns and 219 rows omitted

The colon indicates that all items should be retained (whether columns or rows, depending on the specific argument it is used in place of). Here, it means to keep all rows and columns. However, since combine does ungrouping by default, that means retaining the original content of the data:

julia> nrow(df2) == nrow(data)
true

It is important to note, however, that presently, grouping changes the row order.

julia> df3 = combine(data, :)
234×11 DataFrame
 Row │ Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                1.8   1999      4  auto(l5)      f      ⋯
   2 │ audi          a4                1.8   1999      4  manual(m5)    f
   3 │ audi          a4                2.0   2008      4  manual(m6)    f
   4 │ audi          a4                2.0   2008      4  auto(av)      f
   5 │ audi          a4                2.8   1999      6  auto(l5)      f      ⋯
   6 │ audi          a4                2.8   1999      6  manual(m5)    f
   7 │ audi          a4                3.1   2008      6  auto(av)      f
   8 │ audi          a4 quattro        1.8   1999      4  manual(m5)    4
  ⋮  │      ⋮             ⋮           ⋮       ⋮      ⋮         ⋮             ⋮ ⋱
 228 │ volkswagen    passat            1.8   1999      4  manual(m5)    f      ⋯
 229 │ volkswagen    passat            1.8   1999      4  auto(l5)      f
 230 │ volkswagen    passat            2.0   2008      4  auto(s6)      f
 231 │ volkswagen    passat            2.0   2008      4  manual(m6)    f
 232 │ volkswagen    passat            2.8   1999      6  auto(l5)      f      ⋯
 233 │ volkswagen    passat            2.8   1999      6  manual(m5)    f
 234 │ volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                  5 columns and 219 rows omitted

julia> df3 == data
true

whereas:

julia> df2 == data
false

In r, use of the ungroup function is needed.

Row-wise operation #

Row-wise is, conceptually, setting each row as its own group. Nevertheless, oftentimes we do not need to think this way.

Say you want to know the minimum between 1.5 times the city miles per gallon and highway miles per galon for each model. In r, it would require this:

julia> R"""
       data %>% rowwise() %>% mutate(maxy = max(1.5*Cty, Hwy)) %>% 
           select(maxy, everything())
       """
RObject{VecSxp}
# A tibble: 234 x 12
# Rowwise: 
    maxy Manufacturer Model    Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl   
   <dbl> <fct>        <fct>    <dbl> <int> <int> <fct>   <fct> <int> <int> <fct>
 1  29   audi         a4         1.8  1999     4 auto(l… f        18    29 p    
 2  31.5 audi         a4         1.8  1999     4 manual… f        21    29 p    
 3  31   audi         a4         2    2008     4 manual… f        20    31 p    
 4  31.5 audi         a4         2    2008     4 auto(a… f        21    30 p    
 5  26   audi         a4         2.8  1999     6 auto(l… f        16    26 p    
 6  27   audi         a4         2.8  1999     6 manual… f        18    26 p    
 7  27   audi         a4         3.1  2008     6 auto(a… f        18    27 p    
 8  27   audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p    
 9  25   audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p    
10  30   audi         a4 quat…   2    2008     4 manual… 4        20    28 p    
# … with 224 more rows, and 1 more variable: Class <fct>

In Julia, it is simply a matter of specifying the operation itself to rowwise:

julia> transform(data, [:Cty, :Hwy] => ByRow((x,y)->max(1.5*x,y)) => :maxy) |> 
       x->x[:, Cols([:maxy], :)]
234×12 DataFrame
 Row │ maxy     Manufacturer  Model         Displ    Year   Cyl    Trans       ⋯
     │ Float64  Categorical…  Categorical…  Float64  Int32  Int32  Categorical ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    29.0  audi          a4                1.8   1999      4  auto(l5)    ⋯
   2 │    31.5  audi          a4                1.8   1999      4  manual(m5)
   3 │    31.0  audi          a4                2.0   2008      4  manual(m6)
   4 │    31.5  audi          a4                2.0   2008      4  auto(av)
   5 │    26.0  audi          a4                2.8   1999      6  auto(l5)    ⋯
   6 │    27.0  audi          a4                2.8   1999      6  manual(m5)
   7 │    27.0  audi          a4                3.1   2008      6  auto(av)
   8 │    27.0  audi          a4 quattro        1.8   1999      4  manual(m5)
  ⋮  │    ⋮          ⋮             ⋮           ⋮       ⋮      ⋮         ⋮      ⋱
 228 │    31.5  volkswagen    passat            1.8   1999      4  manual(m5)  ⋯
 229 │    29.0  volkswagen    passat            1.8   1999      4  auto(l5)
 230 │    28.5  volkswagen    passat            2.0   2008      4  auto(s6)
 231 │    31.5  volkswagen    passat            2.0   2008      4  manual(m6)
 232 │    26.0  volkswagen    passat            2.8   1999      6  auto(l5)    ⋯
 233 │    27.0  volkswagen    passat            2.8   1999      6  manual(m5)
 234 │    26.0  volkswagen    passat            3.6   2008      6  auto(s6)
                                                  6 columns and 219 rows omitted

Manipulate Cases #

Filter #

Filtering essentially uses the same principles as applied to any other Julia data structure, and can even use the conventional filter function for this purpose:

julia> filter(:Cty => x->(x>20), data)[:, Cols(:Cty,:)]
45×11 DataFrame
 Row │ Cty    Manufacturer  Model         Displ    Year   Cyl    Trans         ⋯
     │ Int32  Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    21  audi          a4                1.8   1999      4  manual(m5)    ⋯
   2 │    21  audi          a4                2.0   2008      4  auto(av)
   3 │    22  chevrolet     malibu            2.4   2008      4  auto(l4)
   4 │    28  honda         civic             1.6   1999      4  manual(m5)
   5 │    24  honda         civic             1.6   1999      4  auto(l4)      ⋯
   6 │    25  honda         civic             1.6   1999      4  manual(m5)
   7 │    23  honda         civic             1.6   1999      4  manual(m5)
   8 │    24  honda         civic             1.6   1999      4  auto(l4)
  ⋮  │   ⋮         ⋮             ⋮           ⋮       ⋮      ⋮         ⋮        ⋱
  39 │    21  volkswagen    jetta             2.5   2008      5  auto(s6)      ⋯
  40 │    21  volkswagen    jetta             2.5   2008      5  manual(m5)
  41 │    35  volkswagen    new beetle        1.9   1999      4  manual(m5)
  42 │    29  volkswagen    new beetle        1.9   1999      4  auto(l4)
  43 │    21  volkswagen    new beetle        2.0   1999      4  manual(m5)    ⋯
  44 │    21  volkswagen    passat            1.8   1999      4  manual(m5)
  45 │    21  volkswagen    passat            2.0   2008      4  manual(m6)
                                                   4 columns and 30 rows omitted

Julia data collection functions usually take a function as the first argument in order to facilitate the use of do blocks. This doesn’t look nice when using piping. Therefore, my suggestion is to stick with the subset function. Besides, apparently, filter doesn’t work on data groups.

julia> subset(data, :Cty => x->(x.>20))[:, Cols(:Cty,:)]
45×11 DataFrame
 Row │ Cty    Manufacturer  Model         Displ    Year   Cyl    Trans         ⋯
     │ Int32  Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    21  audi          a4                1.8   1999      4  manual(m5)    ⋯
   2 │    21  audi          a4                2.0   2008      4  auto(av)
   3 │    22  chevrolet     malibu            2.4   2008      4  auto(l4)
   4 │    28  honda         civic             1.6   1999      4  manual(m5)
   5 │    24  honda         civic             1.6   1999      4  auto(l4)      ⋯
   6 │    25  honda         civic             1.6   1999      4  manual(m5)
   7 │    23  honda         civic             1.6   1999      4  manual(m5)
   8 │    24  honda         civic             1.6   1999      4  auto(l4)
  ⋮  │   ⋮         ⋮             ⋮           ⋮       ⋮      ⋮         ⋮        ⋱
  39 │    21  volkswagen    jetta             2.5   2008      5  auto(s6)      ⋯
  40 │    21  volkswagen    jetta             2.5   2008      5  manual(m5)
  41 │    35  volkswagen    new beetle        1.9   1999      4  manual(m5)
  42 │    29  volkswagen    new beetle        1.9   1999      4  auto(l4)
  43 │    21  volkswagen    new beetle        2.0   1999      4  manual(m5)    ⋯
  44 │    21  volkswagen    passat            1.8   1999      4  manual(m5)
  45 │    21  volkswagen    passat            2.0   2008      4  manual(m6)
                                                   4 columns and 30 rows omitted

Distinct #

julia> unique(data, [:Cyl, :Drv])
9×11 DataFrame
 Row │ Manufacturer  Model               Displ    Year   Cyl    Trans          ⋯
     │ Categorical…  Categorical…        Float64  Int32  Int32  Categorical…   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                      1.8   1999      4  auto(l5)       ⋯
   2 │ audi          a4                      2.8   1999      6  auto(l5)
   3 │ audi          a4 quattro              1.8   1999      4  manual(m5)
   4 │ audi          a4 quattro              2.8   1999      6  auto(l5)
   5 │ audi          a6 quattro              4.2   2008      8  auto(s6)       ⋯
   6 │ chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   7 │ ford          mustang                 3.8   1999      6  manual(m5)
   8 │ pontiac       grand prix              5.3   2008      8  auto(s4)
   9 │ volkswagen    jetta                   2.5   2008      5  auto(s6)       ⋯
                                                               5 columns omitted

julia> R"""
       distinct(data, Cyl, Drv)
       """
RObject{VecSxp}
# A tibble: 9 x 2
    Cyl Drv  
  <int> <fct>
1     4 f    
2     6 f    
3     4 4    
4     6 4    
5     8 4    
6     8 r    
7     6 r    
8     8 f    
9     5 f

Slice #

The point of a function such as slice is to be used in pipes really. Otherwise one can simply do this:

julia> R"""
       data[4:10,]
       """
RObject{VecSxp}
# A tibble: 7 x 11
  Manufacturer Model    Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
  <fct>        <fct>    <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
1 audi         a4         2    2008     4 auto(a… f        21    30 p     compa…
2 audi         a4         2.8  1999     6 auto(l… f        16    26 p     compa…
3 audi         a4         2.8  1999     6 manual… f        18    26 p     compa…
4 audi         a4         3.1  2008     6 auto(a… f        18    27 p     compa…
5 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     compa…
6 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     compa…
7 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     compa…

Same with julia:

julia> data[4:10,:]
7×11 DataFrame
 Row │ Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                2.0   2008      4  auto(av)      f      ⋯
   2 │ audi          a4                2.8   1999      6  auto(l5)      f
   3 │ audi          a4                2.8   1999      6  manual(m5)    f
   4 │ audi          a4                3.1   2008      6  auto(av)      f
   5 │ audi          a4 quattro        1.8   1999      4  manual(m5)    4      ⋯
   6 │ audi          a4 quattro        1.8   1999      4  auto(l5)      4
   7 │ audi          a4 quattro        2.0   2008      4  manual(m6)    4
                                                               5 columns omitted

Slice-sample #

The StatsBase module can be used here.

julia> data[rand(1:nrow(data), 10),:]
10×11 DataFrame
 Row │ Manufacturer  Model              Displ    Year   Cyl    Trans         D ⋯
     │ Categorical…  Categorical…       Float64  Int32  Int32  Categorical…  C ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ ford          explorer 4wd           4.0   1999      6  auto(l5)      4 ⋯
   2 │ nissan        altima                 2.4   1999      4  auto(l4)      f
   3 │ volkswagen    new beetle             1.9   1999      4  manual(m5)    f
   4 │ hyundai       tiburon                2.7   2008      6  manual(m5)    f
   5 │ toyota        toyota tacoma 4wd      2.7   1999      4  manual(m5)    4 ⋯
   6 │ chevrolet     corvette               6.2   2008      8  manual(m6)    r
   7 │ hyundai       sonata                 3.3   2008      6  auto(l5)      f
   8 │ dodge         durango 4wd            5.9   1999      8  auto(l4)      4
   9 │ ford          f150 pickup 4wd        4.6   1999      8  manual(m5)    4 ⋯
  10 │ volkswagen    new beetle             2.5   2008      5  auto(s6)      f
                                                               5 columns omitted

Slice min and max #

This is just a convenience function:

julia> data[data.Cty .>= quantile(data.Cty, 0.75),:]
76×11 DataFrame
 Row │ Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                1.8   1999      4  manual(m5)    f      ⋯
   2 │ audi          a4                2.0   2008      4  manual(m6)    f
   3 │ audi          a4                2.0   2008      4  auto(av)      f
   4 │ audi          a4 quattro        2.0   2008      4  manual(m6)    4
   5 │ audi          a4 quattro        2.0   2008      4  auto(s6)      4      ⋯
   6 │ chevrolet     malibu            2.4   1999      4  auto(l4)      f
   7 │ chevrolet     malibu            2.4   2008      4  auto(l4)      f
   8 │ honda         civic             1.6   1999      4  manual(m5)    f
  ⋮  │      ⋮             ⋮           ⋮       ⋮      ⋮         ⋮             ⋮ ⋱
  70 │ volkswagen    new beetle        2.0   1999      4  manual(m5)    f      ⋯
  71 │ volkswagen    new beetle        2.0   1999      4  auto(l4)      f
  72 │ volkswagen    new beetle        2.5   2008      5  manual(m5)    f
  73 │ volkswagen    new beetle        2.5   2008      5  auto(s6)      f
  74 │ volkswagen    passat            1.8   1999      4  manual(m5)    f      ⋯
  75 │ volkswagen    passat            2.0   2008      4  auto(s6)      f
  76 │ volkswagen    passat            2.0   2008      4  manual(m6)    f
                                                   5 columns and 61 rows omitted

The R code:

julia> R"""
       slice_max(data, Cty, prop = 0.25)
       """
RObject{VecSxp}
# A tibble: 76 x 11
   Manufacturer Model   Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
   <fct>        <fct>   <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
 1 volkswagen   new be…   1.9  1999     4 manual… f        35    44 d     subco…
 2 volkswagen   jetta     1.9  1999     4 manual… f        33    44 d     compa…
 3 volkswagen   new be…   1.9  1999     4 auto(l… f        29    41 d     subco…
 4 honda        civic     1.6  1999     4 manual… f        28    33 r     subco…
 5 toyota       corolla   1.8  2008     4 manual… f        28    37 r     compa…
 6 honda        civic     1.8  2008     4 manual… f        26    34 r     subco…
 7 toyota       corolla   1.8  1999     4 manual… f        26    35 r     compa…
 8 toyota       corolla   1.8  2008     4 auto(l… f        26    35 r     compa…
 9 honda        civic     1.6  1999     4 manual… f        25    32 r     subco…
10 honda        civic     1.8  2008     4 auto(l… f        25    36 r     subco…
# … with 66 more rows

Slice head and tail #

again, mostly meaningful for pipes. Nevertheless:

julia> data[end-5:end,:]
6×11 DataFrame
 Row │ Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ volkswagen    passat            1.8   1999      4  auto(l5)      f      ⋯
   2 │ volkswagen    passat            2.0   2008      4  auto(s6)      f
   3 │ volkswagen    passat            2.0   2008      4  manual(m6)    f
   4 │ volkswagen    passat            2.8   1999      6  auto(l5)      f
   5 │ volkswagen    passat            2.8   1999      6  manual(m5)    f      ⋯
   6 │ volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                               5 columns omitted

Arrange #

Instead of arrange, sort is used. The syntax is very clear:

julia> sort(data, [order(:Year, rev=true), :Displ])
234×11 DataFrame
 Row │ Manufacturer  Model                Displ    Year   Cyl    Trans         ⋯
     │ Categorical…  Categorical…         Float64  Int32  Int32  Categorical…  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ honda         civic                    1.8   2008      4  manual(m5)    ⋯
   2 │ honda         civic                    1.8   2008      4  auto(l5)
   3 │ honda         civic                    1.8   2008      4  auto(l5)
   4 │ toyota        corolla                  1.8   2008      4  manual(m5)
   5 │ toyota        corolla                  1.8   2008      4  auto(l4)      ⋯
   6 │ audi          a4                       2.0   2008      4  manual(m6)
   7 │ audi          a4                       2.0   2008      4  auto(av)
   8 │ audi          a4 quattro               2.0   2008      4  manual(m6)
  ⋮  │      ⋮                 ⋮              ⋮       ⋮      ⋮         ⋮        ⋱
 228 │ chevrolet     c1500 suburban 2wd       5.7   1999      8  auto(l4)      ⋯
 229 │ chevrolet     corvette                 5.7   1999      8  manual(m6)
 230 │ chevrolet     corvette                 5.7   1999      8  auto(l4)
 231 │ chevrolet     k1500 tahoe 4wd          5.7   1999      8  auto(l4)
 232 │ dodge         durango 4wd              5.9   1999      8  auto(l4)      ⋯
 233 │ dodge         ram 1500 pickup 4wd      5.9   1999      8  auto(l4)
 234 │ chevrolet     k1500 tahoe 4wd          6.5   1999      8  auto(l4)
                                                  5 columns and 219 rows omitted

julia> R"""
       arrange(data, desc(Year), Displ)
       """
RObject{VecSxp}
# A tibble: 234 x 11
   Manufacturer Model   Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
   <fct>        <fct>   <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
 1 honda        civic     1.8  2008     4 manual… f        26    34 r     subco…
 2 honda        civic     1.8  2008     4 auto(l… f        25    36 r     subco…
 3 honda        civic     1.8  2008     4 auto(l… f        24    36 c     subco…
 4 toyota       corolla   1.8  2008     4 manual… f        28    37 r     compa…
 5 toyota       corolla   1.8  2008     4 auto(l… f        26    35 r     compa…
 6 audi         a4        2    2008     4 manual… f        20    31 p     compa…
 7 audi         a4        2    2008     4 auto(a… f        21    30 p     compa…
 8 audi         a4 qua…   2    2008     4 manual… 4        20    28 p     compa…
 9 audi         a4 qua…   2    2008     4 auto(s… 4        19    27 p     compa…
10 honda        civic     2    2008     4 manual… f        21    29 p     subco…
# … with 224 more rows

One can in fact use any kind of sorting. See the sorting section in DataFrames.jl.

Adding row at position #

Again, this is just a convenience function. Let’s add a duplicate row for instance:

julia> vcat(data, DataFrame(data[5,:]))
235×11 DataFrame
 Row │ Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                1.8   1999      4  auto(l5)      f      ⋯
   2 │ audi          a4                1.8   1999      4  manual(m5)    f
   3 │ audi          a4                2.0   2008      4  manual(m6)    f
   4 │ audi          a4                2.0   2008      4  auto(av)      f
   5 │ audi          a4                2.8   1999      6  auto(l5)      f      ⋯
   6 │ audi          a4                2.8   1999      6  manual(m5)    f
   7 │ audi          a4                3.1   2008      6  auto(av)      f
   8 │ audi          a4 quattro        1.8   1999      4  manual(m5)    4
  ⋮  │      ⋮             ⋮           ⋮       ⋮      ⋮         ⋮             ⋮ ⋱
 229 │ volkswagen    passat            1.8   1999      4  auto(l5)      f      ⋯
 230 │ volkswagen    passat            2.0   2008      4  auto(s6)      f
 231 │ volkswagen    passat            2.0   2008      4  manual(m6)    f
 232 │ volkswagen    passat            2.8   1999      6  auto(l5)      f
 233 │ volkswagen    passat            2.8   1999      6  manual(m5)    f      ⋯
 234 │ volkswagen    passat            3.6   2008      6  auto(s6)      f
 235 │ audi          a4                2.8   1999      6  auto(l5)      f
                                                  5 columns and 220 rows omitted

The issue is that specifying the position at which to insert the row can take extra work. See also append! and push! with the catch being modification of the original dataframe.

Extract variables #

Conventional indexing can be used:

pull #

One can just use the fact that the dataframe is a collection of vectors in a dictionary-like relation:

julia> data.Hwy
234-element Vector{Int32}:
 29
 29
 31
 30
 26
 26
 27
 26
 25
 28
  ⋮
 28
 29
 29
 29
 28
 29
 26
 26
 26

julia> R"pull(data, Hwy)"
RObject{IntSxp}
  [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
 [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
 [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
 [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
[101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
[126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
[151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
[176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
[201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
[226] 28 29 29 29 28 29 26 26 26

select #

julia> data[:, [:Hwy, :Cty]]
234×2 DataFrame
 Row │ Hwy    Cty
     │ Int32  Int32
─────┼──────────────
   1 │    29     18
   2 │    29     21
   3 │    31     20
   4 │    30     21
   5 │    26     16
   6 │    26     18
   7 │    27     18
   8 │    26     18
  ⋮  │   ⋮      ⋮
 228 │    29     21
 229 │    29     18
 230 │    28     19
 231 │    29     21
 232 │    26     16
 233 │    26     18
 234 │    26     17
    219 rows omitted

Note that when selecting only a single column, it is no different than the dot syntax and the return structure is not dataframe:

julia> data[:, :Hwy]
234-element Vector{Int32}:
 29
 29
 31
 30
 26
 26
 27
 26
 25
 28
  ⋮
 28
 29
 29
 29
 28
 29
 26
 26
 26

Of course, wrapping in vector the column symbol gives the desired result:

julia> data[:, [:Hwy]]
234×1 DataFrame
 Row │ Hwy
     │ Int32
─────┼───────
   1 │    29
   2 │    29
   3 │    31
   4 │    30
   5 │    26
   6 │    26
   7 │    27
   8 │    26
  ⋮  │   ⋮
 228 │    29
 229 │    29
 230 │    28
 231 │    29
 232 │    26
 233 │    26
 234 │    26
219 rows omitted

relocate #

Just use Cols as before. For instance, to put columns starting with m at front:

julia> data[:, Cols(filter(x->occursin(r"^[mM].*", x), names(data)),
           :Year, :Class, :)]
234×11 DataFrame
 Row │ Manufacturer  Model         Year   Class         Displ    Cyl    Trans  ⋯
     │ Categorical…  Categorical…  Int32  Categorical…  Float64  Int32  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4             1999  compact           1.8      4  auto(l ⋯
   2 │ audi          a4             1999  compact           1.8      4  manual
   3 │ audi          a4             2008  compact           2.0      4  manual
   4 │ audi          a4             2008  compact           2.0      4  auto(a
   5 │ audi          a4             1999  compact           2.8      6  auto(l ⋯
   6 │ audi          a4             1999  compact           2.8      6  manual
   7 │ audi          a4             2008  compact           3.1      6  auto(a
   8 │ audi          a4 quattro     1999  compact           1.8      4  manual
  ⋮  │      ⋮             ⋮          ⋮         ⋮           ⋮       ⋮         ⋮ ⋱
 228 │ volkswagen    passat         1999  midsize           1.8      4  manual ⋯
 229 │ volkswagen    passat         1999  midsize           1.8      4  auto(l
 230 │ volkswagen    passat         2008  midsize           2.0      4  auto(s
 231 │ volkswagen    passat         2008  midsize           2.0      4  manual
 232 │ volkswagen    passat         1999  midsize           2.8      6  auto(l ⋯
 233 │ volkswagen    passat         1999  midsize           2.8      6  manual
 234 │ volkswagen    passat         2008  midsize           3.6      6  auto(s
                                                  5 columns and 219 rows omitted

To put Model after Year:

julia> data[:, Cols( setdiff(propertynames(data), [:Model]) |> x->insert!(x,
           findfirst(x .== :Year) + 1, :Model))]
234×11 DataFrame
 Row │ Manufacturer  Displ    Year   Model         Cyl    Trans         Drv    ⋯
     │ Categorical…  Float64  Int32  Categorical…  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi              1.8   1999  a4                4  auto(l5)      f      ⋯
   2 │ audi              1.8   1999  a4                4  manual(m5)    f
   3 │ audi              2.0   2008  a4                4  manual(m6)    f
   4 │ audi              2.0   2008  a4                4  auto(av)      f
   5 │ audi              2.8   1999  a4                6  auto(l5)      f      ⋯
   6 │ audi              2.8   1999  a4                6  manual(m5)    f
   7 │ audi              3.1   2008  a4                6  auto(av)      f
   8 │ audi              1.8   1999  a4 quattro        4  manual(m5)    4
  ⋮  │      ⋮           ⋮       ⋮         ⋮          ⋮         ⋮             ⋮ ⋱
 228 │ volkswagen        1.8   1999  passat            4  manual(m5)    f      ⋯
 229 │ volkswagen        1.8   1999  passat            4  auto(l5)      f
 230 │ volkswagen        2.0   2008  passat            4  auto(s6)      f
 231 │ volkswagen        2.0   2008  passat            4  manual(m6)    f
 232 │ volkswagen        2.8   1999  passat            6  auto(l5)      f      ⋯
 233 │ volkswagen        2.8   1999  passat            6  manual(m5)    f
 234 │ volkswagen        3.6   2008  passat            6  auto(s6)      f
                                                  5 columns and 219 rows omitted

julia> R"""
       relocate(data, Model, .after = Year)
       """
RObject{VecSxp}
# A tibble: 234 x 11
   Manufacturer Displ  Year Model      Cyl Trans   Drv     Cty   Hwy Fl    Class
   <fct>        <dbl> <int> <fct>    <int> <fct>   <fct> <int> <int> <fct> <fct>
 1 audi           1.8  1999 a4           4 auto(l… f        18    29 p     comp…
 2 audi           1.8  1999 a4           4 manual… f        21    29 p     comp…
 3 audi           2    2008 a4           4 manual… f        20    31 p     comp…
 4 audi           2    2008 a4           4 auto(a… f        21    30 p     comp…
 5 audi           2.8  1999 a4           6 auto(l… f        16    26 p     comp…
 6 audi           2.8  1999 a4           6 manual… f        18    26 p     comp…
 7 audi           3.1  2008 a4           6 auto(a… f        18    27 p     comp…
 8 audi           1.8  1999 a4 quat…     4 manual… 4        18    26 p     comp…
 9 audi           1.8  1999 a4 quat…     4 auto(l… 4        16    25 p     comp…
10 audi           2    2008 a4 quat…     4 manual… 4        20    28 p     comp…
# … with 224 more rows

Currently, there aren’t any helper functions for selection, though the general procedure isn’t that verbose in base Julia either.

Across function #

As mentioned in the beginning, Julia provides quite flexible ways of combining data, and there is no need for extra across functions, mainly due to the employment of the broadcasting concept.

Making new variables #

mutate, and transmutate #

The mutate equivalent is transform. The select function is the transmutate equivalent, which only selects the specifided columns, and after applying the specified functions, returns only said columns.

rename #

julia> rename(data, [:Model, :Year] .=> [:type, :date])
234×11 DataFrame
 Row │ Manufacturer  type          Displ    date   Cyl    Trans         Drv    ⋯
     │ Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ audi          a4                1.8   1999      4  auto(l5)      f      ⋯
   2 │ audi          a4                1.8   1999      4  manual(m5)    f
   3 │ audi          a4                2.0   2008      4  manual(m6)    f
   4 │ audi          a4                2.0   2008      4  auto(av)      f
   5 │ audi          a4                2.8   1999      6  auto(l5)      f      ⋯
   6 │ audi          a4                2.8   1999      6  manual(m5)    f
   7 │ audi          a4                3.1   2008      6  auto(av)      f
   8 │ audi          a4 quattro        1.8   1999      4  manual(m5)    4
  ⋮  │      ⋮             ⋮           ⋮       ⋮      ⋮         ⋮             ⋮ ⋱
 228 │ volkswagen    passat            1.8   1999      4  manual(m5)    f      ⋯
 229 │ volkswagen    passat            1.8   1999      4  auto(l5)      f
 230 │ volkswagen    passat            2.0   2008      4  auto(s6)      f
 231 │ volkswagen    passat            2.0   2008      4  manual(m6)    f
 232 │ volkswagen    passat            2.8   1999      6  auto(l5)      f      ⋯
 233 │ volkswagen    passat            2.8   1999      6  manual(m5)    f
 234 │ volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                  5 columns and 219 rows omitted

This concludes the first page of the cheatsheet.