Data Tidying
Clear Workspace, DON’T EDIT
Always start by clearing the workspace. This ensure objects created in other files are not used used here.
List Used Packages, EDIT
List all the packages that will be used in chunk below.
Load Packages, DON’T EDIT
Install Missing
Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.
Any missing package will be installed automatically. This ensure smoother execution when run by others.
Load
Load all packages
12.1 Introduction
This page explains how use the tidyr
package to put data in tidy form where:
- each row represents an observation
- each column represents a variable
- each cell contain a single value
Putting data in tidy form will make it easy to process using tidyverse packages.
Tidying is 1-to-1 process–the data takes different form but it can be put back into its original form which mean that no values are lost.
12.2 Lengthening Data, pivot_longer
When pivoting longer, the number of the rows in the dataset increases while the number of columns decreases.
12.2.1 One Variable in Column Headers
12.2.1.1 Toy Dataset
The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::billboard
dataset.
12.2.1.2 Lengthen
I want the values in all the columns that start with
B_
to be placed into a (single) column namedvalue
. To distinguish which value belong to which column, create a new column calledB_type
for this purpuse.
Assume the dimensions of the old dataset are:
- number of cols: co
- number of rows: ro
The dimensions of the new dataset are:
- number of cols: co - number of combined columns - 1
- number of rows: ro * (number of combined columns - 1)
12.2.1.3 Remove NA
Use the argument values_drop_na = TRUE
12.2.1.4 Fix Cell Values
Use the readr::parse_number()
function to extract the first number from var2 variable and ignore all other text.
ds_lengthen <- ds |>
pivot_longer(
cols = starts_with("B_"),
names_to = "B_type",
values_to = "value",
values_drop_na = TRUE
) |>
mutate(
B_type = readr::parse_number(B_type)
)
ds_lengthen
# A tibble: 4 × 4
id A B_type value
<chr> <dbl> <dbl> <dbl>
1 A 1 1 10.1
2 A 1 2 10.2
3 B 2 1 20.1
4 C 3 2 30.2
12.2.2 Multiple Variables in Column Headers
12.2.2.1 Toy Dataset
The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::who2
dataset.
ds2 <- tribble(
~id, ~A, ~B1_C1, ~B1_C2, ~B2_C1, ~B2_C2,
"A", 1, 10.11, 10.12, 10.21, 10.22,
"B", 2, 20.11, 20.12, NA, 20.22,
"C", 3, 30.11, NA, 30.21, 30.22
)
ds2
# A tibble: 3 × 6
id A B1_C1 B1_C2 B2_C1 B2_C2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 10.1 10.1 10.2 10.2
2 B 2 20.1 20.1 NA 20.2
3 C 3 30.1 NA 30.2 30.2
12.2.2.2 Lengthening w/o Seperating Variables
ds_lengthen <- ds2 |>
pivot_longer(
cols = starts_with("B"),
names_to = "B_C",
values_to = "value"
)
ds_lengthen
# A tibble: 12 × 4
id A B_C value
<chr> <dbl> <chr> <dbl>
1 A 1 B1_C1 10.1
2 A 1 B1_C2 10.1
3 A 1 B2_C1 10.2
4 A 1 B2_C2 10.2
5 B 2 B1_C1 20.1
6 B 2 B1_C2 20.1
7 B 2 B2_C1 NA
8 B 2 B2_C2 20.2
9 C 3 B1_C1 30.1
10 C 3 B1_C2 NA
11 C 3 B2_C1 30.2
12 C 3 B2_C2 30.2
12.2.2.3 Lengthening w/ Seperating Variables
ds_lengthen <- ds2 |>
pivot_longer(
cols = !(id:A),
names_sep = "_",
names_to = c("B", "C"),
values_to = "value"
)
ds_lengthen
# A tibble: 12 × 5
id A B C value
<chr> <dbl> <chr> <chr> <dbl>
1 A 1 B1 C1 10.1
2 A 1 B1 C2 10.1
3 A 1 B2 C1 10.2
4 A 1 B2 C2 10.2
5 B 2 B1 C1 20.1
6 B 2 B1 C2 20.1
7 B 2 B2 C1 NA
8 B 2 B2 C2 20.2
9 C 3 B1 C1 30.1
10 C 3 B1 C2 NA
11 C 3 B2 C1 30.2
12 C 3 B2 C2 30.2
12.2.2.4 Dropping NA
ds_lengthen <- ds2 |>
pivot_longer(
cols = !(id:A),
names_sep = "_",
names_to = c("B", "C"),
values_to = "value",
values_drop_na = TRUE
)
ds_lengthen
# A tibble: 10 × 5
id A B C value
<chr> <dbl> <chr> <chr> <dbl>
1 A 1 B1 C1 10.1
2 A 1 B1 C2 10.1
3 A 1 B2 C1 10.2
4 A 1 B2 C2 10.2
5 B 2 B1 C1 20.1
6 B 2 B1 C2 20.1
7 B 2 B2 C2 20.2
8 C 3 B1 C1 30.1
9 C 3 B2 C1 30.2
10 C 3 B2 C2 30.2
12.2.3 Data and Variable Names in Colmnn Headers
12.2.3.1 Toy Dataset
The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::household
dataset.
12.2.3.2 Lengthening w/o Removing NA
12.2.3.3 Lengthening w Removing NA
12.2.3.4 Fixing Cell Values
ds_lengthen <- ds3 |>
pivot_longer(
cols = starts_with("child"),
names_sep = "_",
names_to = c("child", ".value"),
values_drop_na = TRUE
) |>
mutate(
child = readr::parse_number(child)
)
ds_lengthen
# A tibble: 4 × 4
id child name age
<chr> <dbl> <chr> <dbl>
1 A 1 A1 11
2 A 2 A2 12
3 B 1 B1 21
4 C 2 C2 32
12.3 Widening Data, pivot-wider
When pivoting wider, the number of the columns in the dataset increases while the number of rows decreases.
12.3.1 No Missing Values
12.3.1.1 Toy Dataset
Notice that each unique value in the id
column has a single value for each of the unique values in the M
column.
12.3.1.2 Widenning
12.3.2 Missing Values
12.3.2.1 Toy Dataset
Notice that NOT each unique value in the id
column has a single value for each of the unique values in the M
column–B
does not have value for the M2
value.
12.3.2.2 Widenning
Notice that B
observation will be assigned NA
as its value in the M2
column.
12.3.3 Duplicate Values
12.3.3.1 Toy Dataset
Notice that NOT each unique value in the id
column has a single value for each of the unique values in the M
column–B has multiple value for M2
.
12.3.3.2 Widenning
Notice that the generated values are list-cols
–see the warning message for details.
12.4 Reference
- Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10 (webpage)
- details the history and underlying theory behind tidy data