Data Transformation

Clear Workspace, DON’T EDIT

Always start by clearing the workspace. This ensure objects created in other files are not used used here.

rm(list = ls())

List Used Packages, EDIT

List all the packages that will be used in chunk below.

packages = c("nycflights13", "ggplot2", "dplyr")

Load Packages, DON’T EDIT

Install Missing

Any missing package will be installed automatically. This ensure smoother execution when run by others.

Installing Packages on Other People Machine

Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.

# Do NOT modify
install.packages(setdiff(packages, rownames(installed.packages())))
# Downloading packages -------------------------------------------------------
- Downloading nycflights13 from CRAN ...        OK [4.3 Mb]
Successfully downloaded 1 package in 0.35 seconds.

The following package(s) will be installed:
- nycflights13 [1.0.2]
These packages will be installed into "~/work/notebook/notebook/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu".

# Installing packages --------------------------------------------------------
- Installing nycflights13 ...                   OK [built from source and cached in 8.8s]
Successfully installed 1 package in 8.8 seconds.

Load

Load all packages

# Do NOT modify
lapply(packages, require, character.only = TRUE)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

10.1 Introduction

This page introduces the dplyr package used to transform data such as creating new variables, editing existing variables, filtering out observations, and creating summaries.

10.2 dplyr Functions (Verbs)

10.2.1 Four Groups

dplyr functions (verbs) can be grouped into functions that work on:

  • rows, eg, filter(), arrange(), distinct(), count()
  • columns, eg, mutate(), select(), rename(), relocate()
  • groups, eg, summarize(), slice_max, group_by, ungroup(), .by
  • tables

10.2.2 Common Characteristics

All the functions have the followings in common:

  • their first argument is always a data frame
  • their subsequent arguments typically describe which columns to operate on using variable names without quotes
  • they always output a new data frame, they don’t modify the passed one

10.2.3 Pipe |> Operator

  • The pipe |> operator takes what on its left and pass it to the function on its right so that x |> f(y) (pronounced as x then f(y)) is equivalent to f(x, y) and x |> f(y) |> g(z) (pronounced as x then f(y) then g(z)) is equivalent to g(f(x, y), z)
  • The base R pipe operator |> was introduced in R 4.1.0 in 2021 while the tidyverse magrittr pipe operator %>% was introduced in 2014. Using |> instead of %>% makes our code run when we don’t use tidyverse

10.2.4 Row Functions

  • The following filter() statements are equivalent:
    • filter(ds, var == 1 or var == 2)
    • filter(ds, var == 1 | var == 2)
    • filter(ds, var %in% c(1,2))
  • The following filter() statements are equivalent:
    • filter(ds, var == 1 and var == 2)
    • filter(ds, var == 1, var == 2)
  • The following arrange() statements order data differently
    • arrange(ds, var) ascendant order
    • arrange(ds, desc(var)) decedent order
  • The following distinct() statements return different data frames
    • distinct(ds, var1, var2) only keep columns var1 and var2
    • distinct(ds, var1, var2, .keep_all = TRUE) keep all the columns–find the first observation where var1 and var2 are distinct and discard the rest
  • The following count() statements order the results differently
    • count(ds, var1, var2) arrange results in order they are encountered
    • count(ds, var1, var2, sort = TRUE) arrange results in descending order of number of occurrence

10.2.5 Column Functions

  • mutate()
    • Instead of adding the newly created variable to the right hand side of the data frame, we can instruct mutate() to adding before a variable using the .before attribute or after a variable using the .after attribute
    • To only keep the variables involved in the creation of the new variables, we can instruct mutate() to do so by setting the .keep = "used" attribute
  • select()
    • select range of variables: select(ds, var_x:var_y)
    • select all variables except certain range: select(ds, !var_x:var_y)
    • select character variables only: select(ds, where(is.character))
    • select variables whose name start with something: select(ds, start_with("m"))
    • select variables whose name end with something: select(ds, end_with("m"))
    • select variables whose name contain something: select(ds, contains("m"))
    • select variables whose name follow some range: select(ds, num_range("x", 1:3))
    • select and rename variable: select(ds, var1_new = var1, var2_new = var2)
  • rename()
    • to rename many columns, it is better to use janitor::clean_names() function
  • relocate()
    • By default, bring columns to left hand side of the data frame
    • relocate(ds, var1, .after = var2) puts var1 after var2
    • relocate(ds, var1, .before = var2) puts var1 before var2

10.2.6 Groups Functions

  • group_by()
    • divides the data into groups so that subsequent operations work on these groups
    • it added a class to the dataset to indicate the grouping
  • summarize() or summarise()
    • To prevent summary statistics functions, eg, mean() to give NA due to some groups has NA (missing) values, set their argument: na.rm = TRUE
    • The summary statistics function n() gives the number of observations in the group
    • Each summary peels off the last group. To prevent this behavior, change the default value drop_last of the .groups argument of the summary statistic function to either keep to keep all groups or drop to drop all groups
  • `slice_` functions
    • The functions are: slice_head(), slice_tail(), slice_min(), slice_max(), and slice_sample()
    • To slice a number of rows from each group, use the n arguments, eg, n = 1
    • To slice percentage of rows from each group, use the prop argument, eg, prop = .1 (10%)
    • To prevent ties from showing, use with_ties = FALSE argument
  • `.by` argument
    • New addition to dplyr 1.1.0 (more information at dplyr 1.1.0 blog post)
    • Per-operation grouping–can be used all verbs. The advantage is that we don’t need to use the .groups argument to suppress the warning message raised by summarize() when grouping by multiple variables and we don’t need to use ungroup() when done with our summary.