14  Mid-semester Review

Learning Goals

Review the basics of wrangling and visualization

Instructions

General

  • Be kind to yourself.
  • Collaborate with and be kind to others. You are expected to work together as a group.
  • Ask questions. Remember that we won’t discuss these exercises as a class.

Activity Specific

Help each other with the following:

  • Create a new Quarto document in the activities folder of your portfolio project and do not forgot to include it in _quarto.yml file. Then click the </> Code link at the top right corner of this page and copy the code into the created Quarto document. This is where you’ll take notes. Remember that the portfolio is yours, so take notes in whatever way is best for you.
  • Don’t start the Exercises section before we get there as a class. First, engage in class discussion and eventually collaborate with your group on the exercises!

14.1 Warm-up

Thus far, we’ve learned how to:

  • use ggplot() to construct data visualizations
  • do some wrangling:
    • arrange() our data in a meaningful order
    • subset the data to only filter() the rows and select() the columns of interest
    • mutate() existing variables and define new variables
    • summarize() various aspects of a variable, both overall and by group (group_by())
  • reshape our data to fit the task at hand (pivot_longer(), pivot_wider())
  • join() different datasets into one

Let’s review some basics, emphasizing some themes in Homework 3 feedback!

Along the way, pay special attention to formatting your code: code is communication.





EXAMPLE 1: Make a plot

Recall our data on hiking the “high peaks” in the Adirondack Mountains of northern New York state. This includes data on the hike’s highest elevation (feet), vertical ascent (feet), length (miles), time in hours that it takes to complete, and difficulty rating.

library(tidyverse)

hikes <- read.csv("https://mac-stat.github.io/data/high_peaks.csv")

Construct a plot that allows us to examine how vertical ascent varies from hike to hike.





EXAMPLE 2: What’s wrong?

Critique the following interpretation of the above plot:

“The typical ascent is around 3000 feet.”





EXAMPLE 3: Captions, axis labels, and titles

Critique the use of the axis labels, caption, and title here. Then make a better version.

ggplot(hikes, aes(x = ascent)) + 
  geom_density() + 
  labs(x = "the vertical ascent of a hike in feet",
       title = "Density plot of hike vertical ascent")

A density plot of the vertical ascent of a hike, in feet





EXAMPLE 4: Wrangling practice – one verb

# How many hikes are in the dataset?


# What's the maximum elevation among the hikes?


# How many hikes are there of each rating?


# What hikes have elevations above 5000 ft?





EXAMPLE 5: Wrangling practice – multiple verbs

# What's the average hike length for each rating category?


# What's the average length of *only* the easy hikes


# What 6 hikes take the longest time to complete?


# What 6 hikes take the longest time per mile?

14.2 Solutions

Click for Solutions

EXAMPLE 1: Make a plot

# A boxplot or histogram could also work!
ggplot(hikes, aes(x = ascent)) + 
  geom_density()





EXAMPLE 2: What’s wrong?

That interpretation doesn’t say anything about the variability in ascent or other important features.





EXAMPLE 3: Captions, axis labels, and titles

The axis label is too long, and the caption and title are redundant.

# Better
ggplot(hikes, aes(x = ascent)) + 
  geom_density() + 
  labs(x = "vertical ascent (feet)")

A density plot of the vertical ascent of a hike, in feet





EXAMPLE 4: Wrangling practice – one verb

# How many hikes are in the dataset?
hikes |> 
  nrow()
[1] 46
# What's the maximum elevation among the hikes?
hikes |> 
  summarize(max(elevation))
  max(elevation)
1           5344
# How many hikes are there of each rating?
hikes |> 
  count(rating)
     rating  n
1 difficult  8
2      easy 11
3  moderate 27
# What hikes have elevations above 5000 ft?
hikes |> 
  filter(elevation > 5000)
             peak elevation difficulty ascent length time   rating
1     Mt. Marcy        5344          5   3166   14.8   10 moderate
2 Algonquin Peak       5114          5   2936    9.6    9 moderate





EXAMPLE 5: Wrangling practice – multiple verbs

# What's the average hike length for each rating category?
hikes |> 
  group_by(rating) |> 
  summarize(mean(length))
# A tibble: 3 × 2
  rating    `mean(length)`
  <chr>              <dbl>
1 difficult          17.0 
2 easy                9.05
3 moderate           12.7 
# What's the average length of *only* the easy hikes
hikes |> 
  filter(rating == "easy") |> 
  summarize(mean(length))
  mean(length)
1     9.045455
# What 6 hikes take the longest time to complete?
hikes |> 
  arrange(desc(time)) |> 
  head()
            peak elevation difficulty ascent length time    rating
1    Mt. Emmons       4040          7   3490   18.0   18 difficult
2   Seward Mtn.       4361          7   3490   16.0   17 difficult
3 Mt. Donaldson       4140          7   3490   17.0   17 difficult
4  Mt. Skylight       4926          7   4265   17.9   15 difficult
5     Gray Peak       4840          7   4178   16.0   14 difficult
6  Mt. Redfield       4606          7   3225   17.5   14 difficult
# What 6 hikes take the longest time per mile?
hikes |> 
  mutate(time_per_mile = time / length) |> 
  arrange(desc(time_per_mile)) |> 
  head()
           peak elevation difficulty ascent length time    rating time_per_mile
1   Giant Mtn.       4627          4   3050    6.0  7.5      easy      1.250000
2     Nye Mtn.       3895          6   1844    7.5  8.5  moderate      1.133333
3  Street Mtn.       4166          6   2115    8.8  9.5  moderate      1.079545
4  Seward Mtn.       4361          7   3490   16.0 17.0 difficult      1.062500
5    South Dix       4060          6   3050   11.5 12.0  moderate      1.043478
6 Cascade Mtn.       4098          2   1940    4.8  5.0      easy      1.041667