Skip to main content

dplyr Cheatsheet to DataFrames.jl Page 1

·25 mins

Introduction #

With the release of v1.0 of the DataFrames.jl package, it would seem appropriate to introduce a rather comprehensive cheatsheet of it. One that is of special use to people who come from tidyverse (arguably, the best data transformation syntax there is for combined expressiveness and brevity).

The order of topics is the same as dplyr cheatsheet.

using RDatasets, RCall, DataFrames, StatsBase, InteractiveErrors

R"library(dplyr)";

Setting up the data using RCall #

data = dataset("datasets", "iris")

@rput data;

Summarize Cases #

summarize #

julia> R"summarize(data, avSL = mean(SepalLength), avSW = mean(SepalWidth))"
RObject{VecSxp}
      avSL     avSW
1 5.843333 3.057333

In Julia:

julia> combine(data, [:SepalLength, :SepalWidth] .=> mean .=> [:avSL, :avSW])
1×2 DataFrame
 Row  avSL     avSW
      Float64  Float64
─────┼──────────────────
   1  5.84333  3.05733

There are two important things to note. Firstly, broadcasting is mandatory. Secondly, the syntax allows for one factoring of the function in the middle.

Now let us summarize by different functions:

julia> combine(data, [:SepalLength, :SepalWidth] .=> [maximum, minimum] .=>
               [:maxSL, :minSW])
1×2 DataFrame
 Row  maxSL    minSW
      Float64  Float64
─────┼──────────────────
   1      7.9      2.0

And with anonymous functions:

julia> combine(data, [:SepalLength, :SepalWidth] .=> 
               [x->sum(x./100), x->sum(sqrt.(x))] .=> [:x1, :x2])
1×2 DataFrame
 Row  x1       x2
      Float64  Float64
─────┼──────────────────
   1    8.765  261.619

The r equivalent would be:

julia> R"""
       summarize(data, x1=sum(SepalLength/100), x2=sum(sqrt(SepalWidth)))
       """
RObject{VecSxp}
     x1       x2
1 8.765 261.6187

The reason that r is often-times nicer is quotation. With DataFrames.jl on the other hand, the idea is to simply pass the actual data and data identifiers around. This means greater flexibility but slightly more verbose syntax.

For instance, consider summarizing by matching column names:

julia> combine(data, filter(x->occursin.("Sepal", x), names(data)) .=> mean)
1×2 DataFrame
 Row  SepalLength_mean  SepalWidth_mean
      Float64           Float64
─────┼───────────────────────────────────
   1           5.84333          3.05733

In R, one needs to explicitly use other function to escape the quoting procedure:

julia> R"""
       summarize(data, across(contains("Sepal"), mean))
       """
RObject{VecSxp}
  SepalLength SepalWidth
1    5.843333   3.057333

Both have their ups and downs. Basically, the field of nonstandard evaluation will require its own functions, whereas, by avoiding that, one can rely on existing functions or at least only slightly modified functions (based on multiple dispatch for instance) to arrive at the desired results.

count #

data = dataset("ggplot2", "mpg")

@rput data;

#

julia> R"""
       data <- as_tibble(data)
       count(data, Cyl, Drv)
       """
RObject{VecSxp}
# A tibble: 9 x 3
    Cyl Drv       n
  <int> <fct> <int>
1     4 4        23
2     4 f        58
3     5 f         4
4     6 4        32
5     6 f        43
6     6 r         4
7     8 4        48
8     8 f         1
9     8 r        21

Any function that operates on (sub)dataframes can be used immediately (nrow here):

julia> combine(groupby(data, [:Cyl, :Drv], sort = true), nrow)
9×3 DataFrame
 Row  Cyl    Drv   nrow
      Int32  Cat…  Int64
─────┼────────────────────
   1      4  4        23
   2      4  f        58
   3      5  f         4
   4      6  4        32
   5      6  f        43
   6      6  r         4
   7      8  4        48
   8      8  f         1
   9      8  r        21

This was a proper interlude to the next topic, grouping.

Group Cases #

Ungroup #

The general grouping operation in DataFrames.jl is as above, whereby a “sub” dataframe is recognized, perhaps not overly different from views in arrays.

Ungrouping is done rather automatically. The default value of ungroup in the transformation functions is true.

julia> df = groupby(data, [:Cyl, :Drv], sort = true)
GroupedDataFrame with 9 groups based on keys: Cyl, Drv
First Group (23 rows): Cyl = 4, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "4"
 Row  Manufacturer  Model              Displ    Year   Cyl    Trans         D 
      Categorical…  Categorical…       Float64  Int32  Int32  Categorical…  C 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4 quattro             1.8   1999      4  manual(m5)    4 
   2  audi          a4 quattro             1.8   1999      4  auto(l5)      4
   3  audi          a4 quattro             2.0   2008      4  manual(m6)    4
   4  audi          a4 quattro             2.0   2008      4  auto(s6)      4
   5  subaru        forester awd           2.5   1999      4  manual(m5)    4 
   6  subaru        forester awd           2.5   1999      4  auto(l4)      4
   7  subaru        forester awd           2.5   2008      4  manual(m5)    4
   8  subaru        forester awd           2.5   2008      4  manual(m5)    4
                                                                       
  16  subaru        impreza awd            2.5   2008      4  auto(s4)      4 
  17  subaru        impreza awd            2.5   2008      4  manual(m5)    4
  18  subaru        impreza awd            2.5   2008      4  manual(m5)    4
  19  toyota        4runner 4wd            2.7   1999      4  manual(m5)    4
  20  toyota        4runner 4wd            2.7   1999      4  auto(l4)      4 
  21  toyota        toyota tacoma 4wd      2.7   1999      4  manual(m5)    4
  22  toyota        toyota tacoma 4wd      2.7   1999      4  auto(l4)      4
  23  toyota        toyota tacoma 4wd      2.7   2008      4  manual(m5)    4
                                                    5 columns and 7 rows omitted

Last Group (21 rows): Cyl = 8, Drv = CategoricalArrays.CategoricalValue{String, UInt8} "r"
 Row  Manufacturer  Model               Displ    Year   Cyl    Trans          
      Categorical…  Categorical…        Float64  Int32  Int32  Categorical…   
─────┼──────────────────────────────────────────────────────────────────────────
   1  chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)       
   2  chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   3  chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   4  chevrolet     c1500 suburban 2wd      5.7   1999      8  auto(l4)
   5  chevrolet     c1500 suburban 2wd      6.0   2008      8  auto(l4)       
   6  chevrolet     corvette                5.7   1999      8  manual(m6)
   7  chevrolet     corvette                5.7   1999      8  auto(l4)
   8  chevrolet     corvette                6.2   2008      8  manual(m6)
                                                                       
  14  ford          mustang                 4.6   1999      8  auto(l4)       
  15  ford          mustang                 4.6   1999      8  manual(m5)
  16  ford          mustang                 4.6   2008      8  manual(m5)
  17  ford          mustang                 4.6   2008      8  auto(l5)
  18  ford          mustang                 5.4   2008      8  manual(m6)     
  19  lincoln       navigator 2wd           5.4   1999      8  auto(l4)
  20  lincoln       navigator 2wd           5.4   1999      8  auto(l4)
  21  lincoln       navigator 2wd           5.4   2008      8  auto(l6)
                                                    5 columns and 5 rows omitted

julia> df2 = combine(df, :)
234×11 DataFrame
 Row  Cyl    Drv   Manufacturer  Model           Displ    Year   Trans        
      Int32  Cat…  Cat…          Cat…            Float64  Int32  Cat…         
─────┼──────────────────────────────────────────────────────────────────────────
   1      4  4     audi          a4 quattro          1.8   1999  manual(m5)   
   2      4  4     audi          a4 quattro          1.8   1999  auto(l5)
   3      4  4     audi          a4 quattro          2.0   2008  manual(m6)
   4      4  4     audi          a4 quattro          2.0   2008  auto(s6)
   5      4  4     subaru        forester awd        2.5   1999  manual(m5)   
   6      4  4     subaru        forester awd        2.5   1999  auto(l4)
   7      4  4     subaru        forester awd        2.5   2008  manual(m5)
   8      4  4     subaru        forester awd        2.5   2008  manual(m5)
                                                                      
 228      8  r     ford          mustang             4.6   1999  manual(m5)   
 229      8  r     ford          mustang             4.6   2008  manual(m5)
 230      8  r     ford          mustang             4.6   2008  auto(l5)
 231      8  r     ford          mustang             5.4   2008  manual(m6)
 232      8  r     lincoln       navigator 2wd       5.4   1999  auto(l4)     
 233      8  r     lincoln       navigator 2wd       5.4   1999  auto(l4)
 234      8  r     lincoln       navigator 2wd       5.4   2008  auto(l6)
                                                  4 columns and 219 rows omitted

The colon indicates that all items should be retained (whether columns or rows, depending on the specific argument it is used in place of). Here, it means to keep all rows and columns. However, since combine does ungrouping by default, that means retaining the original content of the data:

julia> nrow(df2) == nrow(data)
true

It is important to note, however, that presently, grouping changes the row order.

julia> df3 = combine(data, :)
234×11 DataFrame
 Row  Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                1.8   1999      4  auto(l5)      f      
   2  audi          a4                1.8   1999      4  manual(m5)    f
   3  audi          a4                2.0   2008      4  manual(m6)    f
   4  audi          a4                2.0   2008      4  auto(av)      f
   5  audi          a4                2.8   1999      6  auto(l5)      f      
   6  audi          a4                2.8   1999      6  manual(m5)    f
   7  audi          a4                3.1   2008      6  auto(av)      f
   8  audi          a4 quattro        1.8   1999      4  manual(m5)    4
                                                                      
 228  volkswagen    passat            1.8   1999      4  manual(m5)    f      
 229  volkswagen    passat            1.8   1999      4  auto(l5)      f
 230  volkswagen    passat            2.0   2008      4  auto(s6)      f
 231  volkswagen    passat            2.0   2008      4  manual(m6)    f
 232  volkswagen    passat            2.8   1999      6  auto(l5)      f      
 233  volkswagen    passat            2.8   1999      6  manual(m5)    f
 234  volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                  5 columns and 219 rows omitted

julia> df3 == data
true

whereas:

julia> df2 == data
false

In r, use of the ungroup function is needed.

Row-wise operation #

Row-wise is, conceptually, setting each row as its own group. Nevertheless, oftentimes we do not need to think this way.

Say you want to know the minimum between 1.5 times the city miles per gallon and highway miles per galon for each model. In r, it would require this:

julia> R"""
       data %>% rowwise() %>% mutate(maxy = max(1.5*Cty, Hwy)) %>% 
           select(maxy, everything())
       """
RObject{VecSxp}
# A tibble: 234 x 12
# Rowwise: 
    maxy Manufacturer Model    Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl   
   <dbl> <fct>        <fct>    <dbl> <int> <int> <fct>   <fct> <int> <int> <fct>
 1  29   audi         a4         1.8  1999     4 auto(l… f        18    29 p    
 2  31.5 audi         a4         1.8  1999     4 manual… f        21    29 p    
 3  31   audi         a4         2    2008     4 manual… f        20    31 p    
 4  31.5 audi         a4         2    2008     4 auto(a… f        21    30 p    
 5  26   audi         a4         2.8  1999     6 auto(l… f        16    26 p    
 6  27   audi         a4         2.8  1999     6 manual… f        18    26 p    
 7  27   audi         a4         3.1  2008     6 auto(a… f        18    27 p    
 8  27   audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p    
 9  25   audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p    
10  30   audi         a4 quat…   2    2008     4 manual… 4        20    28 p    
# … with 224 more rows, and 1 more variable: Class <fct>

In Julia, it is simply a matter of specifying the operation itself to rowwise:

julia> transform(data, [:Cty, :Hwy] => ByRow((x,y)->max(1.5*x,y)) => :maxy) |> 
       x->x[:, Cols([:maxy], :)]
234×12 DataFrame
 Row  maxy     Manufacturer  Model         Displ    Year   Cyl    Trans       
      Float64  Categorical…  Categorical…  Float64  Int32  Int32  Categorical 
─────┼──────────────────────────────────────────────────────────────────────────
   1     29.0  audi          a4                1.8   1999      4  auto(l5)    
   2     31.5  audi          a4                1.8   1999      4  manual(m5)
   3     31.0  audi          a4                2.0   2008      4  manual(m6)
   4     31.5  audi          a4                2.0   2008      4  auto(av)
   5     26.0  audi          a4                2.8   1999      6  auto(l5)    
   6     27.0  audi          a4                2.8   1999      6  manual(m5)
   7     27.0  audi          a4                3.1   2008      6  auto(av)
   8     27.0  audi          a4 quattro        1.8   1999      4  manual(m5)
                                                                      
 228     31.5  volkswagen    passat            1.8   1999      4  manual(m5)  
 229     29.0  volkswagen    passat            1.8   1999      4  auto(l5)
 230     28.5  volkswagen    passat            2.0   2008      4  auto(s6)
 231     31.5  volkswagen    passat            2.0   2008      4  manual(m6)
 232     26.0  volkswagen    passat            2.8   1999      6  auto(l5)    
 233     27.0  volkswagen    passat            2.8   1999      6  manual(m5)
 234     26.0  volkswagen    passat            3.6   2008      6  auto(s6)
                                                  6 columns and 219 rows omitted

Manipulate Cases #

Filter #

Filtering essentially uses the same principles as applied to any other Julia data structure, and can even use the conventional filter function for this purpose:

julia> filter(:Cty => x->(x>20), data)[:, Cols(:Cty,:)]
45×11 DataFrame
 Row  Cty    Manufacturer  Model         Displ    Year   Cyl    Trans         
      Int32  Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  
─────┼──────────────────────────────────────────────────────────────────────────
   1     21  audi          a4                1.8   1999      4  manual(m5)    
   2     21  audi          a4                2.0   2008      4  auto(av)
   3     22  chevrolet     malibu            2.4   2008      4  auto(l4)
   4     28  honda         civic             1.6   1999      4  manual(m5)
   5     24  honda         civic             1.6   1999      4  auto(l4)      
   6     25  honda         civic             1.6   1999      4  manual(m5)
   7     23  honda         civic             1.6   1999      4  manual(m5)
   8     24  honda         civic             1.6   1999      4  auto(l4)
                                                                      
  39     21  volkswagen    jetta             2.5   2008      5  auto(s6)      
  40     21  volkswagen    jetta             2.5   2008      5  manual(m5)
  41     35  volkswagen    new beetle        1.9   1999      4  manual(m5)
  42     29  volkswagen    new beetle        1.9   1999      4  auto(l4)
  43     21  volkswagen    new beetle        2.0   1999      4  manual(m5)    
  44     21  volkswagen    passat            1.8   1999      4  manual(m5)
  45     21  volkswagen    passat            2.0   2008      4  manual(m6)
                                                   4 columns and 30 rows omitted

Julia data collection functions usually take a function as the first argument in order to facilitate the use of do blocks. This doesn’t look nice when using piping. Therefore, my suggestion is to stick with the subset function. Besides, apparently, filter doesn’t work on data groups.

julia> subset(data, :Cty => x->(x.>20))[:, Cols(:Cty,:)]
45×11 DataFrame
 Row  Cty    Manufacturer  Model         Displ    Year   Cyl    Trans         
      Int32  Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  
─────┼──────────────────────────────────────────────────────────────────────────
   1     21  audi          a4                1.8   1999      4  manual(m5)    
   2     21  audi          a4                2.0   2008      4  auto(av)
   3     22  chevrolet     malibu            2.4   2008      4  auto(l4)
   4     28  honda         civic             1.6   1999      4  manual(m5)
   5     24  honda         civic             1.6   1999      4  auto(l4)      
   6     25  honda         civic             1.6   1999      4  manual(m5)
   7     23  honda         civic             1.6   1999      4  manual(m5)
   8     24  honda         civic             1.6   1999      4  auto(l4)
                                                                      
  39     21  volkswagen    jetta             2.5   2008      5  auto(s6)      
  40     21  volkswagen    jetta             2.5   2008      5  manual(m5)
  41     35  volkswagen    new beetle        1.9   1999      4  manual(m5)
  42     29  volkswagen    new beetle        1.9   1999      4  auto(l4)
  43     21  volkswagen    new beetle        2.0   1999      4  manual(m5)    
  44     21  volkswagen    passat            1.8   1999      4  manual(m5)
  45     21  volkswagen    passat            2.0   2008      4  manual(m6)
                                                   4 columns and 30 rows omitted

Distinct #

julia> unique(data, [:Cyl, :Drv])
9×11 DataFrame
 Row  Manufacturer  Model               Displ    Year   Cyl    Trans          
      Categorical…  Categorical…        Float64  Int32  Int32  Categorical…   
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                      1.8   1999      4  auto(l5)       
   2  audi          a4                      2.8   1999      6  auto(l5)
   3  audi          a4 quattro              1.8   1999      4  manual(m5)
   4  audi          a4 quattro              2.8   1999      6  auto(l5)
   5  audi          a6 quattro              4.2   2008      8  auto(s6)       
   6  chevrolet     c1500 suburban 2wd      5.3   2008      8  auto(l4)
   7  ford          mustang                 3.8   1999      6  manual(m5)
   8  pontiac       grand prix              5.3   2008      8  auto(s4)
   9  volkswagen    jetta                   2.5   2008      5  auto(s6)       
                                                               5 columns omitted
julia> R"""
       distinct(data, Cyl, Drv)
       """
RObject{VecSxp}
# A tibble: 9 x 2
    Cyl Drv  
  <int> <fct>
1     4 f    
2     6 f    
3     4 4    
4     6 4    
5     8 4    
6     8 r    
7     6 r    
8     8 f    
9     5 f    

Slice #

The point of a function such as slice is to be used in pipes really. Otherwise one can simply do this:

julia> R"""
       data[4:10,]
       """
RObject{VecSxp}
# A tibble: 7 x 11
  Manufacturer Model    Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
  <fct>        <fct>    <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
1 audi         a4         2    2008     4 auto(a… f        21    30 p     compa…
2 audi         a4         2.8  1999     6 auto(l… f        16    26 p     compa…
3 audi         a4         2.8  1999     6 manual… f        18    26 p     compa…
4 audi         a4         3.1  2008     6 auto(a… f        18    27 p     compa…
5 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     compa…
6 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     compa…
7 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     compa…

Same with julia:

julia> data[4:10,:]
7×11 DataFrame
 Row  Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                2.0   2008      4  auto(av)      f      
   2  audi          a4                2.8   1999      6  auto(l5)      f
   3  audi          a4                2.8   1999      6  manual(m5)    f
   4  audi          a4                3.1   2008      6  auto(av)      f
   5  audi          a4 quattro        1.8   1999      4  manual(m5)    4      
   6  audi          a4 quattro        1.8   1999      4  auto(l5)      4
   7  audi          a4 quattro        2.0   2008      4  manual(m6)    4
                                                               5 columns omitted

Slice-sample #

The StatsBase module can be used here.

julia> data[rand(1:nrow(data), 10),:]
10×11 DataFrame
 Row  Manufacturer  Model              Displ    Year   Cyl    Trans         D 
      Categorical…  Categorical…       Float64  Int32  Int32  Categorical…  C 
─────┼──────────────────────────────────────────────────────────────────────────
   1  ford          explorer 4wd           4.0   1999      6  auto(l5)      4 
   2  nissan        altima                 2.4   1999      4  auto(l4)      f
   3  volkswagen    new beetle             1.9   1999      4  manual(m5)    f
   4  hyundai       tiburon                2.7   2008      6  manual(m5)    f
   5  toyota        toyota tacoma 4wd      2.7   1999      4  manual(m5)    4 
   6  chevrolet     corvette               6.2   2008      8  manual(m6)    r
   7  hyundai       sonata                 3.3   2008      6  auto(l5)      f
   8  dodge         durango 4wd            5.9   1999      8  auto(l4)      4
   9  ford          f150 pickup 4wd        4.6   1999      8  manual(m5)    4 
  10  volkswagen    new beetle             2.5   2008      5  auto(s6)      f
                                                               5 columns omitted

Slice min and max #

This is just a convenience function:

julia> data[data.Cty .>= quantile(data.Cty, 0.75),:]
76×11 DataFrame
 Row  Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                1.8   1999      4  manual(m5)    f      
   2  audi          a4                2.0   2008      4  manual(m6)    f
   3  audi          a4                2.0   2008      4  auto(av)      f
   4  audi          a4 quattro        2.0   2008      4  manual(m6)    4
   5  audi          a4 quattro        2.0   2008      4  auto(s6)      4      
   6  chevrolet     malibu            2.4   1999      4  auto(l4)      f
   7  chevrolet     malibu            2.4   2008      4  auto(l4)      f
   8  honda         civic             1.6   1999      4  manual(m5)    f
                                                                      
  70  volkswagen    new beetle        2.0   1999      4  manual(m5)    f      
  71  volkswagen    new beetle        2.0   1999      4  auto(l4)      f
  72  volkswagen    new beetle        2.5   2008      5  manual(m5)    f
  73  volkswagen    new beetle        2.5   2008      5  auto(s6)      f
  74  volkswagen    passat            1.8   1999      4  manual(m5)    f      
  75  volkswagen    passat            2.0   2008      4  auto(s6)      f
  76  volkswagen    passat            2.0   2008      4  manual(m6)    f
                                                   5 columns and 61 rows omitted

The R code:

julia> R"""
       slice_max(data, Cty, prop = 0.25)
       """
RObject{VecSxp}
# A tibble: 76 x 11
   Manufacturer Model   Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
   <fct>        <fct>   <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
 1 volkswagen   new be…   1.9  1999     4 manual… f        35    44 d     subco…
 2 volkswagen   jetta     1.9  1999     4 manual… f        33    44 d     compa…
 3 volkswagen   new be…   1.9  1999     4 auto(l… f        29    41 d     subco…
 4 honda        civic     1.6  1999     4 manual… f        28    33 r     subco…
 5 toyota       corolla   1.8  2008     4 manual… f        28    37 r     compa…
 6 honda        civic     1.8  2008     4 manual… f        26    34 r     subco…
 7 toyota       corolla   1.8  1999     4 manual… f        26    35 r     compa…
 8 toyota       corolla   1.8  2008     4 auto(l… f        26    35 r     compa…
 9 honda        civic     1.6  1999     4 manual… f        25    32 r     subco…
10 honda        civic     1.8  2008     4 auto(l… f        25    36 r     subco…
# … with 66 more rows

Slice head and tail #

again, mostly meaningful for pipes. Nevertheless:

julia> data[end-5:end,:]
6×11 DataFrame
 Row  Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  volkswagen    passat            1.8   1999      4  auto(l5)      f      
   2  volkswagen    passat            2.0   2008      4  auto(s6)      f
   3  volkswagen    passat            2.0   2008      4  manual(m6)    f
   4  volkswagen    passat            2.8   1999      6  auto(l5)      f
   5  volkswagen    passat            2.8   1999      6  manual(m5)    f      
   6  volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                               5 columns omitted

Arrange #

Instead of arrange, sort is used. The syntax is very clear:

julia> sort(data, [order(:Year, rev=true), :Displ])
234×11 DataFrame
 Row  Manufacturer  Model                Displ    Year   Cyl    Trans         
      Categorical…  Categorical…         Float64  Int32  Int32  Categorical…  
─────┼──────────────────────────────────────────────────────────────────────────
   1  honda         civic                    1.8   2008      4  manual(m5)    
   2  honda         civic                    1.8   2008      4  auto(l5)
   3  honda         civic                    1.8   2008      4  auto(l5)
   4  toyota        corolla                  1.8   2008      4  manual(m5)
   5  toyota        corolla                  1.8   2008      4  auto(l4)      
   6  audi          a4                       2.0   2008      4  manual(m6)
   7  audi          a4                       2.0   2008      4  auto(av)
   8  audi          a4 quattro               2.0   2008      4  manual(m6)
                                                                       
 228  chevrolet     c1500 suburban 2wd       5.7   1999      8  auto(l4)      
 229  chevrolet     corvette                 5.7   1999      8  manual(m6)
 230  chevrolet     corvette                 5.7   1999      8  auto(l4)
 231  chevrolet     k1500 tahoe 4wd          5.7   1999      8  auto(l4)
 232  dodge         durango 4wd              5.9   1999      8  auto(l4)      
 233  dodge         ram 1500 pickup 4wd      5.9   1999      8  auto(l4)
 234  chevrolet     k1500 tahoe 4wd          6.5   1999      8  auto(l4)
                                                  5 columns and 219 rows omitted
julia> R"""
       arrange(data, desc(Year), Displ)
       """
RObject{VecSxp}
# A tibble: 234 x 11
   Manufacturer Model   Displ  Year   Cyl Trans   Drv     Cty   Hwy Fl    Class 
   <fct>        <fct>   <dbl> <int> <int> <fct>   <fct> <int> <int> <fct> <fct> 
 1 honda        civic     1.8  2008     4 manual… f        26    34 r     subco…
 2 honda        civic     1.8  2008     4 auto(l… f        25    36 r     subco…
 3 honda        civic     1.8  2008     4 auto(l… f        24    36 c     subco…
 4 toyota       corolla   1.8  2008     4 manual… f        28    37 r     compa…
 5 toyota       corolla   1.8  2008     4 auto(l… f        26    35 r     compa…
 6 audi         a4        2    2008     4 manual… f        20    31 p     compa…
 7 audi         a4        2    2008     4 auto(a… f        21    30 p     compa…
 8 audi         a4 qua…   2    2008     4 manual… 4        20    28 p     compa…
 9 audi         a4 qua…   2    2008     4 auto(s… 4        19    27 p     compa…
10 honda        civic     2    2008     4 manual… f        21    29 p     subco…
# … with 224 more rows

One can in fact use any kind of sorting. See the sorting section in DataFrames.jl.

Adding row at position #

Again, this is just a convenience function. Let’s add a duplicate row for instance:

julia> vcat(data, DataFrame(data[5,:]))
235×11 DataFrame
 Row  Manufacturer  Model         Displ    Year   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                1.8   1999      4  auto(l5)      f      
   2  audi          a4                1.8   1999      4  manual(m5)    f
   3  audi          a4                2.0   2008      4  manual(m6)    f
   4  audi          a4                2.0   2008      4  auto(av)      f
   5  audi          a4                2.8   1999      6  auto(l5)      f      
   6  audi          a4                2.8   1999      6  manual(m5)    f
   7  audi          a4                3.1   2008      6  auto(av)      f
   8  audi          a4 quattro        1.8   1999      4  manual(m5)    4
                                                                      
 229  volkswagen    passat            1.8   1999      4  auto(l5)      f      
 230  volkswagen    passat            2.0   2008      4  auto(s6)      f
 231  volkswagen    passat            2.0   2008      4  manual(m6)    f
 232  volkswagen    passat            2.8   1999      6  auto(l5)      f
 233  volkswagen    passat            2.8   1999      6  manual(m5)    f      
 234  volkswagen    passat            3.6   2008      6  auto(s6)      f
 235  audi          a4                2.8   1999      6  auto(l5)      f
                                                  5 columns and 220 rows omitted

The issue is that specifying the position at which to insert the row can take extra work. See also append! and push! with the catch being modification of the original dataframe.

Extract variables #

Conventional indexing can be used:

pull #

One can just use the fact that the dataframe is a collection of vectors in a dictionary-like relation:

julia> data.Hwy
234-element Vector{Int32}:
 29
 29
 31
 30
 26
 26
 27
 26
 25
 28
  
 28
 29
 29
 29
 28
 29
 26
 26
 26
julia> R"pull(data, Hwy)"
RObject{IntSxp}
  [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
 [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
 [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
 [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
[101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
[126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
[151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
[176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
[201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
[226] 28 29 29 29 28 29 26 26 26

select #

julia> data[:, [:Hwy, :Cty]]
234×2 DataFrame
 Row  Hwy    Cty
      Int32  Int32
─────┼──────────────
   1     29     18
   2     29     21
   3     31     20
   4     30     21
   5     26     16
   6     26     18
   7     27     18
   8     26     18
             
 228     29     21
 229     29     18
 230     28     19
 231     29     21
 232     26     16
 233     26     18
 234     26     17
    219 rows omitted

Note that when selecting only a single column, it is no different than the dot syntax and the return structure is not dataframe:

julia> data[:, :Hwy]
234-element Vector{Int32}:
 29
 29
 31
 30
 26
 26
 27
 26
 25
 28
  
 28
 29
 29
 29
 28
 29
 26
 26
 26

Of course, wrapping in vector the column symbol gives the desired result:

julia> data[:, [:Hwy]]
234×1 DataFrame
 Row  Hwy
      Int32
─────┼───────
   1     29
   2     29
   3     31
   4     30
   5     26
   6     26
   7     27
   8     26
       
 228     29
 229     29
 230     28
 231     29
 232     26
 233     26
 234     26
219 rows omitted

relocate #

Just use Cols as before. For instance, to put columns starting with m at front:

julia> data[:, Cols(filter(x->occursin(r"^[mM].*", x), names(data)),
           :Year, :Class, :)]
234×11 DataFrame
 Row  Manufacturer  Model         Year   Class         Displ    Cyl    Trans  
      Categorical…  Categorical…  Int32  Categorical…  Float64  Int32  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4             1999  compact           1.8      4  auto(l 
   2  audi          a4             1999  compact           1.8      4  manual
   3  audi          a4             2008  compact           2.0      4  manual
   4  audi          a4             2008  compact           2.0      4  auto(a
   5  audi          a4             1999  compact           2.8      6  auto(l 
   6  audi          a4             1999  compact           2.8      6  manual
   7  audi          a4             2008  compact           3.1      6  auto(a
   8  audi          a4 quattro     1999  compact           1.8      4  manual
                                                                      
 228  volkswagen    passat         1999  midsize           1.8      4  manual 
 229  volkswagen    passat         1999  midsize           1.8      4  auto(l
 230  volkswagen    passat         2008  midsize           2.0      4  auto(s
 231  volkswagen    passat         2008  midsize           2.0      4  manual
 232  volkswagen    passat         1999  midsize           2.8      6  auto(l 
 233  volkswagen    passat         1999  midsize           2.8      6  manual
 234  volkswagen    passat         2008  midsize           3.6      6  auto(s
                                                  5 columns and 219 rows omitted

To put Model after Year:

julia> data[:, Cols( setdiff(propertynames(data), [:Model]) |> x->insert!(x,
           findfirst(x .== :Year) + 1, :Model))]
234×11 DataFrame
 Row  Manufacturer  Displ    Year   Model         Cyl    Trans         Drv    
      Categorical…  Float64  Int32  Categorical…  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi              1.8   1999  a4                4  auto(l5)      f      
   2  audi              1.8   1999  a4                4  manual(m5)    f
   3  audi              2.0   2008  a4                4  manual(m6)    f
   4  audi              2.0   2008  a4                4  auto(av)      f
   5  audi              2.8   1999  a4                6  auto(l5)      f      
   6  audi              2.8   1999  a4                6  manual(m5)    f
   7  audi              3.1   2008  a4                6  auto(av)      f
   8  audi              1.8   1999  a4 quattro        4  manual(m5)    4
                                                                      
 228  volkswagen        1.8   1999  passat            4  manual(m5)    f      
 229  volkswagen        1.8   1999  passat            4  auto(l5)      f
 230  volkswagen        2.0   2008  passat            4  auto(s6)      f
 231  volkswagen        2.0   2008  passat            4  manual(m6)    f
 232  volkswagen        2.8   1999  passat            6  auto(l5)      f      
 233  volkswagen        2.8   1999  passat            6  manual(m5)    f
 234  volkswagen        3.6   2008  passat            6  auto(s6)      f
                                                  5 columns and 219 rows omitted
julia> R"""
       relocate(data, Model, .after = Year)
       """
RObject{VecSxp}
# A tibble: 234 x 11
   Manufacturer Displ  Year Model      Cyl Trans   Drv     Cty   Hwy Fl    Class
   <fct>        <dbl> <int> <fct>    <int> <fct>   <fct> <int> <int> <fct> <fct>
 1 audi           1.8  1999 a4           4 auto(l… f        18    29 p     comp…
 2 audi           1.8  1999 a4           4 manual… f        21    29 p     comp…
 3 audi           2    2008 a4           4 manual… f        20    31 p     comp…
 4 audi           2    2008 a4           4 auto(a… f        21    30 p     comp…
 5 audi           2.8  1999 a4           6 auto(l… f        16    26 p     comp…
 6 audi           2.8  1999 a4           6 manual… f        18    26 p     comp…
 7 audi           3.1  2008 a4           6 auto(a… f        18    27 p     comp…
 8 audi           1.8  1999 a4 quat…     4 manual… 4        18    26 p     comp…
 9 audi           1.8  1999 a4 quat…     4 auto(l… 4        16    25 p     comp…
10 audi           2    2008 a4 quat…     4 manual… 4        20    28 p     comp…
# … with 224 more rows

Currently, there aren’t any helper functions for selection, though the general procedure isn’t that verbose in base Julia either.

Across function #

As mentioned in the beginning, Julia provides quite flexible ways of combining data, and there is no need for extra across functions, mainly due to the employment of the broadcasting concept.

Making new variables #

mutate, and transmutate #

The mutate equivalent is transform. The select function is the transmutate equivalent, which only selects the specifided columns, and after applying the specified functions, returns only said columns.

rename #

julia> rename(data, [:Model, :Year] .=> [:type, :date])
234×11 DataFrame
 Row  Manufacturer  type          Displ    date   Cyl    Trans         Drv    
      Categorical…  Categorical…  Float64  Int32  Int32  Categorical…  Catego 
─────┼──────────────────────────────────────────────────────────────────────────
   1  audi          a4                1.8   1999      4  auto(l5)      f      
   2  audi          a4                1.8   1999      4  manual(m5)    f
   3  audi          a4                2.0   2008      4  manual(m6)    f
   4  audi          a4                2.0   2008      4  auto(av)      f
   5  audi          a4                2.8   1999      6  auto(l5)      f      
   6  audi          a4                2.8   1999      6  manual(m5)    f
   7  audi          a4                3.1   2008      6  auto(av)      f
   8  audi          a4 quattro        1.8   1999      4  manual(m5)    4
                                                                      
 228  volkswagen    passat            1.8   1999      4  manual(m5)    f      
 229  volkswagen    passat            1.8   1999      4  auto(l5)      f
 230  volkswagen    passat            2.0   2008      4  auto(s6)      f
 231  volkswagen    passat            2.0   2008      4  manual(m6)    f
 232  volkswagen    passat            2.8   1999      6  auto(l5)      f      
 233  volkswagen    passat            2.8   1999      6  manual(m5)    f
 234  volkswagen    passat            3.6   2008      6  auto(s6)      f
                                                  5 columns and 219 rows omitted

This concludes the first page of the cheatsheet.