Data Transformation

Clear Workspace, DON’T EDIT

Always start by clearing the workspace. This ensure objects created in other files are not used used here.

rm(list = ls())

List Used Packages, EDIT

List all the packages that will be used in chunk below.

packages = c("nycflights13", "ggplot2", "dplyr")

Load Packages, DON’T EDIT

Install Missing

Any missing package will be installed automatically. This ensure smoother execution when run by others.

Installing Packages on Other People Machine

Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.

# Do NOT modify
install.packages(setdiff(packages, rownames(installed.packages())))

# Downloading packages -------------------------------------------------------
- Downloading nycflights13 from CRAN ...        OK [4.3 Mb]
Successfully downloaded 1 package in 0.35 seconds.

The following package(s) will be installed:
- nycflights13 [1.0.2]
These packages will be installed into "~/work/notebook/notebook/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu".

# Installing packages --------------------------------------------------------
- Installing nycflights13 ...                   OK [built from source and cached in 8.8s]
Successfully installed 1 package in 8.8 seconds.

Load

Load all packages

# Do NOT modify
lapply(packages, require, character.only = TRUE)

[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

10.1 Introduction

This page introduces the dplyr package used to transform data such as creating new variables, editing existing variables, filtering out observations, and creating summaries.

10.2 `dplyr` Functions (Verbs)

10.2.1 Four Groups

dplyr functions (verbs) can be grouped into functions that work on:

rows, eg, filter(), arrange(), distinct(), count()
columns, eg, mutate(), select(), rename(), relocate()
groups, eg, summarize(), slice_max, group_by, ungroup(), .by
tables

10.2.2 Common Characteristics

All the functions have the followings in common:

their first argument is always a data frame
their subsequent arguments typically describe which columns to operate on using variable names without quotes
they always output a new data frame, they don’t modify the passed one

10.2.3 Pipe `|>` Operator

The pipe |> operator takes what on its left and pass it to the function on its right so that x |> f(y) (pronounced as x then f(y)) is equivalent to f(x, y) and x |> f(y) |> g(z) (pronounced as x then f(y) then g(z)) is equivalent to g(f(x, y), z)
The base R pipe operator |> was introduced in R 4.1.0 in 2021 while the tidyverse magrittr pipe operator %>% was introduced in 2014. Using |> instead of %>% makes our code run when we don’t use tidyverse

10.2.4 Row Functions

The following filter() statements are equivalent:
- filter(ds, var == 1 or var == 2)
- filter(ds, var == 1 | var == 2)
- filter(ds, var %in% c(1,2))
The following filter() statements are equivalent:
- filter(ds, var == 1 and var == 2)
- filter(ds, var == 1, var == 2)
The following arrange() statements order data differently
- arrange(ds, var) ascendant order
- arrange(ds, desc(var)) decedent order
The following distinct() statements return different data frames
- distinct(ds, var1, var2) only keep columns var1 and var2
- distinct(ds, var1, var2, .keep_all = TRUE) keep all the columns–find the first observation where var1 and var2 are distinct and discard the rest
The following count() statements order the results differently
- count(ds, var1, var2) arrange results in order they are encountered
- count(ds, var1, var2, sort = TRUE) arrange results in descending order of number of occurrence

10.2.5 Column Functions

mutate()
- Instead of adding the newly created variable to the right hand side of the data frame, we can instruct mutate() to adding before a variable using the .before attribute or after a variable using the .after attribute
- To only keep the variables involved in the creation of the new variables, we can instruct mutate() to do so by setting the .keep = "used" attribute
select()
- select range of variables: select(ds, var_x:var_y)
- select all variables except certain range: select(ds, !var_x:var_y)
- select character variables only: select(ds, where(is.character))
- select variables whose name start with something: select(ds, start_with("m"))
- select variables whose name end with something: select(ds, end_with("m"))
- select variables whose name contain something: select(ds, contains("m"))
- select variables whose name follow some range: select(ds, num_range("x", 1:3))
- select and rename variable: select(ds, var1_new = var1, var2_new = var2)
rename()
- to rename many columns, it is better to use janitor::clean_names() function
relocate()
- By default, bring columns to left hand side of the data frame
- relocate(ds, var1, .after = var2) puts var1 after var2
- relocate(ds, var1, .before = var2) puts var1 before var2

10.2.6 Groups Functions

group_by()
- divides the data into groups so that subsequent operations work on these groups
- it added a class to the dataset to indicate the grouping
summarize() or summarise()
- To prevent summary statistics functions, eg, mean() to give NA due to some groups has NA (missing) values, set their argument: na.rm = TRUE
- The summary statistics function n() gives the number of observations in the group
- Each summary peels off the last group. To prevent this behavior, change the default value drop_last of the .groups argument of the summary statistic function to either keep to keep all groups or drop to drop all groups
```
`slice_` functions
```
- The functions are: slice_head(), slice_tail(), slice_min(), slice_max(), and slice_sample()
- To slice a number of rows from each group, use the n arguments, eg, n = 1
- To slice percentage of rows from each group, use the prop argument, eg, prop = .1 (10%)
- To prevent ties from showing, use with_ties = FALSE argument
```
`.by` argument
```
- New addition to dplyr 1.1.0 (more information at dplyr 1.1.0 blog post)
- Per-operation grouping–can be used all verbs. The advantage is that we don’t need to use the .groups argument to suppress the warning message raised by summarize() when grouping by multiple variables and we don’t need to use ungroup() when done with our summary.

--- title: "Data Transformation" number-sections: true --- ## Clear Workspace, DON'T EDIT {-} Always start by clearing the workspace. This ensure objects created in other files are not used used here. ```{r} #| include: true #| results: 'hide' rm(list = ls()) ``` ## List Used Packages, EDIT {-} List all the packages that will be used in chunk below. ```{r} #| include: true #| echo: true #| error: true packages = c("nycflights13", "ggplot2", "dplyr") ``` ## Load Packages, DON'T EDIT {#sec-packages -} ### Install Missing {-} Any missing package will be installed automatically. This ensure smoother execution when run by others. ::: callout-important #### Installing Packages on Other People Machine Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code. ::: ```{r Install Missing Packages} #| include: true #| echo: true #| error: true # Do NOT modify install.packages(setdiff(packages, rownames(installed.packages()))) ``` ### Load {-} Load all packages ```{r Load Packages} #| include: true #| echo: true #| error: true # Do NOT modify lapply(packages, require, character.only = TRUE) ``` ## Introduction This page introduces the `dplyr` package used to transform data such as creating new variables, editing existing variables, filtering out observations, and creating summaries. ## `dplyr` Functions (Verbs) ### Four Groups `dplyr` functions (verbs) can be grouped into functions that work on: - rows, eg, `filter()`, `arrange()`, `distinct()`, `count()` - columns, eg, `mutate()`, `select()`, `rename()`, `relocate()` - groups, eg, `summarize()`, `slice_max`, `group_by`, `ungroup()`, `.by` - tables ### Common Characteristics All the functions have the followings in common: - their first argument is always a data frame - their subsequent arguments typically describe which columns to operate on using variable names *without quotes* - they always output a new data frame, they don't modify the passed one ### Pipe `|>` Operator - The pipe `|>` operator takes what on its left and pass it to the function on its right so that `x |> f(y)` (pronounced as `x` *then* `f(y)`) is equivalent to `f(x, y)` and `x |> f(y) |> g(z)` (pronounced as `x` then `f(y)` then `g(z)`) is equivalent to `g(f(x, y), z)` - The `base` R pipe operator `|>` was introduced in R 4.1.0 in 2021 while the tidyverse `magrittr` pipe operator `%>%` was introduced in 2014. Using `|>` instead of `%>%` makes our code run when we don't use tidyverse ### Row Functions - The following `filter()` statements are equivalent: - `filter(ds, var == 1 or var == 2)` - `filter(ds, var == 1 | var == 2)` - `filter(ds, var %in% c(1,2))` - The following `filter()` statements are equivalent: - `filter(ds, var == 1 and var == 2)` - `filter(ds, var == 1, var == 2)` - The following `arrange()` statements order data differently - `arrange(ds, var)` ascendant order - `arrange(ds, desc(var))` decedent order - The following `distinct()` statements return different data frames - `distinct(ds, var1, var2)` only keep columns `var1` and `var2` - `distinct(ds, var1, var2, .keep_all = TRUE)` keep all the columns--find the *first* observation where `var1` and `var2` are distinct and discard the rest - The following `count()` statements order the results differently - `count(ds, var1, var2)` arrange results in order they are encountered - `count(ds, var1, var2, sort = TRUE)` arrange results in descending order of number of occurrence ### Column Functions - `mutate()` - Instead of adding the newly created variable to the right hand side of the data frame, we can instruct `mutate()` to adding before a variable using the `.before` attribute or after a variable using the `.after` attribute - To only keep the variables involved in the creation of the new variables, we can instruct `mutate()` to do so by setting the `.keep = "used"` attribute - `select()` - select range of variables: `select(ds, var_x:var_y)` - select all variables except certain range: `select(ds, !var_x:var_y)` - select character variables only: `select(ds, where(is.character))` - select variables whose name start with something: `select(ds, start_with("m"))` - select variables whose name end with something: `select(ds, end_with("m"))` - select variables whose name contain something: `select(ds, contains("m"))` - select variables whose name follow some range: `select(ds, num_range("x", 1:3))` - select and rename variable: `select(ds, var1_new = var1, var2_new = var2)` - `rename()` - to rename many columns, it is better to use `janitor::clean_names()` function - `relocate()` - By default, bring columns to left hand side of the data frame - `relocate(ds, var1, .after = var2)` puts `var1` after `var2` - `relocate(ds, var1, .before = var2)` puts `var1` before `var2` ### Groups Functions - `group_by()` - divides the data into groups so that subsequent operations work on these groups - it added a *class* to the dataset to indicate the grouping - `summarize()` or `summarise()` - To prevent summary statistics functions, eg, `mean()` to give `NA` due to some groups has `NA` (missing) values, set their argument: `na.rm = TRUE` - The summary statistics function `n()` gives the number of observations in the group - Each summary peels off the last group. To prevent this behavior, change the default value `drop_last` of the `.groups` argument of the summary statistic function to either `keep` to keep all groups or `drop` to drop all groups - `slice_` functions - The functions are: `slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`, and `slice_sample()` - To slice a number of rows from each group, use the `n` arguments, eg, `n = 1` - To slice percentage of rows from each group, use the `prop` argument, eg, `prop = .1` (10%) - To prevent ties from showing, use `with_ties = FALSE` argument - `.by` argument - New addition to dplyr 1.1.0 (more information at [dplyr 1.1.0 blog post](https://www.tidyverse.org/blog/2023/02/dplyr-1-1-0-per-operation-grouping/)) - Per-operation grouping--can be used all verbs. The advantage is that we don't need to use the `.groups` argument to suppress the warning message raised by `summarize()` when grouping by multiple variables and we don't need to use `ungroup()` when done with our summary.

Clear Workspace, DON’T EDIT

List Used Packages, EDIT

Load Packages, DON’T EDIT

Install Missing

Load

10.1 Introduction

10.2 dplyr Functions (Verbs)

10.2.1 Four Groups

10.2.2 Common Characteristics

10.2.3 Pipe |> Operator

10.2.4 Row Functions

10.2.5 Column Functions

10.2.6 Groups Functions

10.2 `dplyr` Functions (Verbs)

10.2.3 Pipe `|>` Operator