Data Tidying

Clear Workspace, DON’T EDIT

Always start by clearing the workspace. This ensure objects created in other files are not used used here.

rm(list = ls())

List Used Packages, EDIT

List all the packages that will be used in chunk below.

packages <- c("styler", "dplyr", "tidyr")

Load Packages, DON’T EDIT

Install Missing

Installing Packages on Other People Machine

Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.

Any missing package will be installed automatically. This ensure smoother execution when run by others.

# Do NOT modify
install.packages(setdiff(packages, rownames(installed.packages())))
- There are no packages to install.

Load

Load all packages

# Do NOT modify
lapply(packages, require, character.only = TRUE)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

12.1 Introduction

This page explains how use the tidyr package to put data in tidy form where:

  • each row represents an observation
  • each column represents a variable
  • each cell contain a single value

Putting data in tidy form will make it easy to process using tidyverse packages.

Data Trasformation Bottom-Line

Tidying is 1-to-1 process–the data takes different form but it can be put back into its original form which mean that no values are lost.

12.2 Lengthening Data, pivot_longer

`pivot_longer’ Idea

When pivoting longer, the number of the rows in the dataset increases while the number of columns decreases.

12.2.1 One Variable in Column Headers

12.2.1.1 Toy Dataset

The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::billboard dataset.

ds <- tribble(
  ~id, ~A, ~B_1, ~B_2,
  "A", 1, 10.1, 10.2,
  "B", 2, 20.1, NA,
  "C", 3, NA, 30.2
)

ds
# A tibble: 3 × 4
  id        A   B_1   B_2
  <chr> <dbl> <dbl> <dbl>
1 A         1  10.1  10.2
2 B         2  20.1  NA  
3 C         3  NA    30.2

12.2.1.2 Lengthen

I want the values in all the columns that start with B_ to be placed into a (single) column named value. To distinguish which value belong to which column, create a new column called B_type for this purpuse.

Dimensions of New Dataset

Assume the dimensions of the old dataset are:

  • number of cols: co
  • number of rows: ro

The dimensions of the new dataset are:

  • number of cols: co - number of combined columns - 1
  • number of rows: ro * (number of combined columns - 1)
ds_lengthen <- ds |> 
  pivot_longer(
    cols = starts_with("B_"),
    names_to = "B_type",
    values_to = "value"
  )

ds_lengthen
# A tibble: 6 × 4
  id        A B_type value
  <chr> <dbl> <chr>  <dbl>
1 A         1 B_1     10.1
2 A         1 B_2     10.2
3 B         2 B_1     20.1
4 B         2 B_2     NA  
5 C         3 B_1     NA  
6 C         3 B_2     30.2

12.2.1.3 Remove NA

Use the argument values_drop_na = TRUE

ds_lengthen <- ds |> 
  pivot_longer(
    cols = starts_with("B_"),
    names_to = "B_type",
    values_to = "value",
    values_drop_na = TRUE
  )

ds_lengthen
# A tibble: 4 × 4
  id        A B_type value
  <chr> <dbl> <chr>  <dbl>
1 A         1 B_1     10.1
2 A         1 B_2     10.2
3 B         2 B_1     20.1
4 C         3 B_2     30.2

12.2.1.4 Fix Cell Values

Use the readr::parse_number() function to extract the first number from var2 variable and ignore all other text.

ds_lengthen <- ds |> 
  pivot_longer(
    cols = starts_with("B_"),
    names_to = "B_type",
    values_to = "value",
    values_drop_na = TRUE
  ) |> 
  mutate(
    B_type = readr::parse_number(B_type)
  )

ds_lengthen
# A tibble: 4 × 4
  id        A B_type value
  <chr> <dbl>  <dbl> <dbl>
1 A         1      1  10.1
2 A         1      2  10.2
3 B         2      1  20.1
4 C         3      2  30.2

12.2.2 Multiple Variables in Column Headers

12.2.2.1 Toy Dataset

The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::who2 dataset.

ds2 <- tribble(
  ~id, ~A, ~B1_C1, ~B1_C2, ~B2_C1, ~B2_C2,
  "A", 1, 10.11, 10.12, 10.21, 10.22,
  "B", 2, 20.11, 20.12, NA, 20.22,
  "C", 3, 30.11, NA, 30.21, 30.22
)

ds2
# A tibble: 3 × 6
  id        A B1_C1 B1_C2 B2_C1 B2_C2
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A         1  10.1  10.1  10.2  10.2
2 B         2  20.1  20.1  NA    20.2
3 C         3  30.1  NA    30.2  30.2

12.2.2.2 Lengthening w/o Seperating Variables

ds_lengthen <- ds2 |> 
  pivot_longer(
    cols = starts_with("B"),
    names_to = "B_C",
    values_to = "value"
  )

ds_lengthen
# A tibble: 12 × 4
   id        A B_C   value
   <chr> <dbl> <chr> <dbl>
 1 A         1 B1_C1  10.1
 2 A         1 B1_C2  10.1
 3 A         1 B2_C1  10.2
 4 A         1 B2_C2  10.2
 5 B         2 B1_C1  20.1
 6 B         2 B1_C2  20.1
 7 B         2 B2_C1  NA  
 8 B         2 B2_C2  20.2
 9 C         3 B1_C1  30.1
10 C         3 B1_C2  NA  
11 C         3 B2_C1  30.2
12 C         3 B2_C2  30.2

12.2.2.3 Lengthening w/ Seperating Variables

ds_lengthen <- ds2 |> 
  pivot_longer(
    cols = !(id:A),
    names_sep = "_",
    names_to = c("B", "C"),
    values_to = "value"
  )

ds_lengthen
# A tibble: 12 × 5
   id        A B     C     value
   <chr> <dbl> <chr> <chr> <dbl>
 1 A         1 B1    C1     10.1
 2 A         1 B1    C2     10.1
 3 A         1 B2    C1     10.2
 4 A         1 B2    C2     10.2
 5 B         2 B1    C1     20.1
 6 B         2 B1    C2     20.1
 7 B         2 B2    C1     NA  
 8 B         2 B2    C2     20.2
 9 C         3 B1    C1     30.1
10 C         3 B1    C2     NA  
11 C         3 B2    C1     30.2
12 C         3 B2    C2     30.2

12.2.2.4 Dropping NA

ds_lengthen <- ds2 |> 
  pivot_longer(
    cols = !(id:A),
    names_sep = "_",
    names_to = c("B", "C"),
    values_to = "value",
    values_drop_na = TRUE
  )

ds_lengthen
# A tibble: 10 × 5
   id        A B     C     value
   <chr> <dbl> <chr> <chr> <dbl>
 1 A         1 B1    C1     10.1
 2 A         1 B1    C2     10.1
 3 A         1 B2    C1     10.2
 4 A         1 B2    C2     10.2
 5 B         2 B1    C1     20.1
 6 B         2 B1    C2     20.1
 7 B         2 B2    C2     20.2
 8 C         3 B1    C1     30.1
 9 C         3 B2    C1     30.2
10 C         3 B2    C2     30.2

12.2.3 Data and Variable Names in Colmnn Headers

12.2.3.1 Toy Dataset

The following toy dataset will be used to illustrate the concepts in this section. The book used tidyr::household dataset.

ds3 <- tribble(
  ~id, ~child1_name, ~child1_age, ~child2_name, ~child2_age,
  "A", "A1", 11, "A2", 12,
  "B", "B1", 21, NA, NA,
  "C", NA, NA, "C2", 32
)

ds3
# A tibble: 3 × 5
  id    child1_name child1_age child2_name child2_age
  <chr> <chr>            <dbl> <chr>            <dbl>
1 A     A1                  11 A2                  12
2 B     B1                  21 <NA>                NA
3 C     <NA>                NA C2                  32

12.2.3.2 Lengthening w/o Removing NA

ds_lengthen <- ds3 |> 
  pivot_longer(
    cols = starts_with("child"),
    names_sep = "_",
    names_to = c("child", ".value")
  )

ds_lengthen
# A tibble: 6 × 4
  id    child  name    age
  <chr> <chr>  <chr> <dbl>
1 A     child1 A1       11
2 A     child2 A2       12
3 B     child1 B1       21
4 B     child2 <NA>     NA
5 C     child1 <NA>     NA
6 C     child2 C2       32

12.2.3.3 Lengthening w Removing NA

ds_lengthen <- ds3 |> 
  pivot_longer(
    cols = starts_with("child"),
    names_sep = "_",
    names_to = c("child", ".value"),
    values_drop_na = TRUE
  )

ds_lengthen
# A tibble: 4 × 4
  id    child  name    age
  <chr> <chr>  <chr> <dbl>
1 A     child1 A1       11
2 A     child2 A2       12
3 B     child1 B1       21
4 C     child2 C2       32

12.2.3.4 Fixing Cell Values

ds_lengthen <- ds3 |> 
  pivot_longer(
    cols = starts_with("child"),
    names_sep = "_",
    names_to = c("child", ".value"),
    values_drop_na = TRUE
  ) |> 
  mutate(
    child = readr::parse_number(child)
  )

ds_lengthen
# A tibble: 4 × 4
  id    child name    age
  <chr> <dbl> <chr> <dbl>
1 A         1 A1       11
2 A         2 A2       12
3 B         1 B1       21
4 C         2 C2       32

12.3 Widening Data, pivot-wider

`pivot_wider’ Idea

When pivoting wider, the number of the columns in the dataset increases while the number of rows decreases.

12.3.1 No Missing Values

12.3.1.1 Toy Dataset

Notice that each unique value in the id column has a single value for each of the unique values in the M column.

ds <- tribble(
  ~id, ~M, ~V,
  "A", "M1", 10,
  "A", "M2", 11,
  "B", "M1", 20,
  "B", "M2", 21,
  "C", "M1", 30,
  "C", "M2", 31
)

ds
# A tibble: 6 × 3
  id    M         V
  <chr> <chr> <dbl>
1 A     M1       10
2 A     M2       11
3 B     M1       20
4 B     M2       21
5 C     M1       30
6 C     M2       31

12.3.1.2 Widenning

ds_widen <- ds |> 
  pivot_wider(
    names_from = M,
    values_from = V
  )

ds_widen
# A tibble: 3 × 3
  id       M1    M2
  <chr> <dbl> <dbl>
1 A        10    11
2 B        20    21
3 C        30    31

12.3.2 Missing Values

12.3.2.1 Toy Dataset

Notice that NOT each unique value in the id column has a single value for each of the unique values in the M column–B does not have value for the M2 value.

ds <- tribble(
  ~id, ~M, ~V,
  "A", "M1", 10,
  "A", "M2", 11,
  "B", "M1", 20,
  "C", "M1", 30,
  "C", "M2", 31
)

ds
# A tibble: 5 × 3
  id    M         V
  <chr> <chr> <dbl>
1 A     M1       10
2 A     M2       11
3 B     M1       20
4 C     M1       30
5 C     M2       31

12.3.2.2 Widenning

Notice that B observation will be assigned NA as its value in the M2 column.

ds_widen <- ds |> 
  pivot_wider(
    names_from = M,
    values_from = V
  )

ds_widen
# A tibble: 3 × 3
  id       M1    M2
  <chr> <dbl> <dbl>
1 A        10    11
2 B        20    NA
3 C        30    31

12.3.3 Duplicate Values

12.3.3.1 Toy Dataset

Notice that NOT each unique value in the id column has a single value for each of the unique values in the M column–B has multiple value for M2.

ds <- tribble(
  ~id, ~M, ~V,
  "A", "M1", 10,
  "A", "M2", 11,
  "B", "M1", 20,
  "B", "M2", 21,
  "B", "M2", 22,
  "C", "M1", 30,
  "C", "M2", 31
)

ds
# A tibble: 7 × 3
  id    M         V
  <chr> <chr> <dbl>
1 A     M1       10
2 A     M2       11
3 B     M1       20
4 B     M2       21
5 B     M2       22
6 C     M1       30
7 C     M2       31

12.3.3.2 Widenning

Notice that the generated values are list-cols–see the warning message for details.

ds_widen <- ds |> 
  pivot_wider(
    names_from = M,
    values_from = V
  )

ds_widen
# A tibble: 3 × 3
  id    M1        M2       
  <chr> <list>    <list>   
1 A     <dbl [1]> <dbl [1]>
2 B     <dbl [1]> <dbl [2]>
3 C     <dbl [1]> <dbl [1]>

12.4 Reference

  • Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10 (webpage)
    • details the history and underlying theory behind tidy data