Data Transformation
Clear Workspace, DON’T EDIT
Always start by clearing the workspace. This ensure objects created in other files are not used used here.
List Used Packages, EDIT
List all the packages that will be used in chunk below.
Load Packages, DON’T EDIT
Install Missing
Any missing package will be installed automatically. This ensure smoother execution when run by others.
Installing Packages on Other People Machine
Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.
# Downloading packages -------------------------------------------------------
- Downloading nycflights13 from CRAN ... OK [4.3 Mb]
Successfully downloaded 1 package in 0.35 seconds.
The following package(s) will be installed:
- nycflights13 [1.0.2]
These packages will be installed into "~/work/notebook/notebook/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu".
# Installing packages --------------------------------------------------------
- Installing nycflights13 ... OK [built from source and cached in 8.8s]
Successfully installed 1 package in 8.8 seconds.
Load
Load all packages
10.1 Introduction
This page introduces the dplyr
package used to transform data such as creating new variables, editing existing variables, filtering out observations, and creating summaries.
10.2 dplyr
Functions (Verbs)
10.2.1 Four Groups
dplyr
functions (verbs) can be grouped into functions that work on:
- rows, eg,
filter()
,arrange()
,distinct()
,count()
- columns, eg,
mutate()
,select()
,rename()
,relocate()
- groups, eg,
summarize()
,slice_max
,group_by
,ungroup()
,.by
- tables
10.2.2 Common Characteristics
All the functions have the followings in common:
- their first argument is always a data frame
- their subsequent arguments typically describe which columns to operate on using variable names without quotes
- they always output a new data frame, they don’t modify the passed one
10.2.3 Pipe |>
Operator
- The pipe
|>
operator takes what on its left and pass it to the function on its right so thatx |> f(y)
(pronounced asx
thenf(y)
) is equivalent tof(x, y)
andx |> f(y) |> g(z)
(pronounced asx
thenf(y)
theng(z)
) is equivalent tog(f(x, y), z)
- The
base
R pipe operator|>
was introduced in R 4.1.0 in 2021 while the tidyversemagrittr
pipe operator%>%
was introduced in 2014. Using|>
instead of%>%
makes our code run when we don’t use tidyverse
10.2.4 Row Functions
- The following
filter()
statements are equivalent:filter(ds, var == 1 or var == 2)
filter(ds, var == 1 | var == 2)
filter(ds, var %in% c(1,2))
- The following
filter()
statements are equivalent:filter(ds, var == 1 and var == 2)
filter(ds, var == 1, var == 2)
- The following
arrange()
statements order data differentlyarrange(ds, var)
ascendant orderarrange(ds, desc(var))
decedent order
- The following
distinct()
statements return different data framesdistinct(ds, var1, var2)
only keep columnsvar1
andvar2
distinct(ds, var1, var2, .keep_all = TRUE)
keep all the columns–find the first observation wherevar1
andvar2
are distinct and discard the rest
- The following
count()
statements order the results differentlycount(ds, var1, var2)
arrange results in order they are encounteredcount(ds, var1, var2, sort = TRUE)
arrange results in descending order of number of occurrence
10.2.5 Column Functions
mutate()
- Instead of adding the newly created variable to the right hand side of the data frame, we can instruct
mutate()
to adding before a variable using the.before
attribute or after a variable using the.after
attribute - To only keep the variables involved in the creation of the new variables, we can instruct
mutate()
to do so by setting the.keep = "used"
attribute
- Instead of adding the newly created variable to the right hand side of the data frame, we can instruct
select()
- select range of variables:
select(ds, var_x:var_y)
- select all variables except certain range:
select(ds, !var_x:var_y)
- select character variables only:
select(ds, where(is.character))
- select variables whose name start with something:
select(ds, start_with("m"))
- select variables whose name end with something:
select(ds, end_with("m"))
- select variables whose name contain something:
select(ds, contains("m"))
- select variables whose name follow some range:
select(ds, num_range("x", 1:3))
- select and rename variable:
select(ds, var1_new = var1, var2_new = var2)
- select range of variables:
rename()
- to rename many columns, it is better to use
janitor::clean_names()
function
- to rename many columns, it is better to use
relocate()
- By default, bring columns to left hand side of the data frame
relocate(ds, var1, .after = var2)
putsvar1
aftervar2
relocate(ds, var1, .before = var2)
putsvar1
beforevar2
10.2.6 Groups Functions
group_by()
- divides the data into groups so that subsequent operations work on these groups
- it added a class to the dataset to indicate the grouping
summarize()
orsummarise()
- To prevent summary statistics functions, eg,
mean()
to giveNA
due to some groups hasNA
(missing) values, set their argument:na.rm = TRUE
- The summary statistics function
n()
gives the number of observations in the group - Each summary peels off the last group. To prevent this behavior, change the default value
drop_last
of the.groups
argument of the summary statistic function to eitherkeep
to keep all groups ordrop
to drop all groups
- To prevent summary statistics functions, eg,
`slice_` functions
- The functions are:
slice_head()
,slice_tail()
,slice_min()
,slice_max()
, andslice_sample()
- To slice a number of rows from each group, use the
n
arguments, eg,n = 1
- To slice percentage of rows from each group, use the
prop
argument, eg,prop = .1
(10%) - To prevent ties from showing, use
with_ties = FALSE
argument
- The functions are:
`.by` argument
- New addition to dplyr 1.1.0 (more information at dplyr 1.1.0 blog post)
- Per-operation grouping–can be used all verbs. The advantage is that we don’t need to use the
.groups
argument to suppress the warning message raised bysummarize()
when grouping by multiple variables and we don’t need to useungroup()
when done with our summary.