Data Visualization
Clear Workspace, DON’T EDIT
Always start by clearing the workspace. This ensure objects created in other files are not used used here.
List Used Packages, EDIT
List all the packages that will be used in chunk below.
Load Packages, DON’T EDIT
Install Missing
Any missing package will be installed automatically. This ensure smoother execution when run by others.
Be aware the people may not like installing packages into their machine automatically. This might break some of their previous code.
# Downloading packages -------------------------------------------------------
- Downloading palmerpenguins from CRAN ... OK [2.9 Mb in 0.94s]
- Downloading ggthemes from CRAN ... OK [482.4 Kb in 0.19s]
Successfully downloaded 2 packages in 1.6 seconds.
The following package(s) will be installed:
- ggthemes [5.1.0]
- palmerpenguins [0.1.1]
These packages will be installed into "~/work/notebook/notebook/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu".
# Installing packages --------------------------------------------------------
- Installing palmerpenguins ... OK [built from source and cached in 1.1s]
- Installing ggthemes ... OK [built from source and cached in 4.6s]
Successfully installed 2 packages in 5.7 seconds.
Load
Load all packages
8.1 Introduction
This page introduces the plot creation vocabulary step by step
8.2 Plot Creation Process
8.2.1 Load Dataset
The penguins
dataset from the palmerpenguins
package will be used for plotting. Typically, the package is loaded using the library
function as shown in the code chunk below. However, a better approach is the one outlined in ?sec-packages.
Explore the dataset
Explore the dataset differently
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
8.2.2 Load Plotting Package
The ggplot2
package will be used for plotting. The package is typically loaded using the library
function as shown in the code chunk below. However, a better approach is the one outlined in ?sec-packages.
8.2.3 Create ggplot
object
Create an empty canvas by instantiating a ggplot
object using the ggplot()
function.
8.2.4 Link Dataset
Link the dataset with the instantiated ggplot
object using the data
parameter.
8.2.5 Map Two Variables
Specify which of the variables in the dataset will be used as the plot aesthetics (visual properties) using the mapping
argument done via the aes()
function.
8.2.6 Display Data
Specify how the data (observations) will be represented geometrically on the plot, eg, bars, points, or line. The functions starting with geom_
is used for this purpose. These functions add layer of the selected geometric object to the plot.
8.2.7 Map Third Variables
Other variables in the dataset can be linked to plot aesthetics (visual properties) using the mapping
argument done via the aes()
function.
8.2.8 Display Three Trendlines
More geometric representations for the data can be specified using the functions starting with geom_
which will add layer of the selected geometric object to the plot.
8.2.9 Display One Trendline
The aesthetic mapping defined in the ggplot()
function is global meaning that all the geom_()
functions inherit it. However, the aesthetic mapping defined in the geom_()
functions are local, ie, not shared with other gemo_()
functions.
8.2.10 Map One Variable Twice
We can link the same variable to multiple plot aesthetics (visual properties) using the mapping
parameter done via the aes()
function.
8.2.11 Fix Labels
The labs()
function can be used to make the plot more accessible. The function will add new layer to the plot and the following items can be added to the layer using the corresponding parameters
- a title using the
title
parameter - a sub-title, if necessary, using the
subtitle
parameter - x-axis title using the
x
parameter - y-axis title using the
y
parameter - data-series label or legend using the
color
and/orshape
parameters
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = 'Palmer Three Species Penguins',
subtitle = 'The flipper length has a moderattly strong positive linear relationship with the body mass',
x = 'Fliper length (mm)',
y = 'Body mass (g)',
shape = 'Species',
color = 'Species'
)
Other types of texts can be added using other functions. The other types of texts are:
- x-axis label
- y-axis label
- data labels, if necessary
- annotation for interesting or important data, if exist
8.2.12 Ensure Color-blind Safe
Make the plot more color-blind safe by using the scale_color_colorblind()
function from the ggthemes
package which will add new layer to the plot.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = 'Palmer Three Species Penguins',
subtitle = 'The flipper length has a moderattly strong positive linear relationship with the body mass',
x = 'Fliper length (mm)',
y = 'Body mass (g)',
shape = 'Species',
color = 'Species'
) +
scale_color_colorblind()
8.2.13 Can Call Implicitly
The first one or two arguments of functions are so important that scientists should know them by heart. Hence, to save some typing, the name of these arguments are usually omitted and only the values assigned to them are kept, ie, the names becomes implicit and no more explicit. Hence, the above call can be written as follows–the arguments data
and mapping
were omitted.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = 'Palmer Three Species Penguins',
subtitle = 'The flipper length has a moderattly strong positive linear relationship with the body mass',
x = 'Fliper length (mm)',
y = 'Body mass (g)',
shape = 'Species',
color = 'Species'
) +
scale_color_colorblind()
8.2.14 Use Pipe Operator
The pipe operator \>
(shortcut: Ctrl+M
) can be used to make the code tidy. The above code can be re-written as follow–notice the dataset was pulled before the call to the ggplot()
function.
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = 'Palmer Three Species Penguins',
subtitle = 'The flipper length has a moderattly strong positive linear relationship with the body mass',
x = 'Fliper length (mm)',
y = 'Body mass (g)',
shape = 'Species',
color = 'Species'
) +
scale_color_colorblind()
8.3 Visualizing Distribution
8.3.1 Categorical Variables
Plot options to visualize how a categorical variable is distributed:
- bar chart, if the counts are not computed, using
gemo_bar()
function - column chart, if the counts are computed,
gemo_col()
function
8.3.2 Numerical Variables
Plot options to visualize how a numerical (discrete or continuous) variable is distributed:
- histogram, using
geom_histogram()
function
The bin width of the histogram is in the unit of the variable mapped to the plot x
(or y
) aesthetic (visual property)
- density plot, using
geom_density()
function - boxplot, using
geom_boplot()
function
As described beautifully in R4DS, a boxplot consists of:
A box that describes the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile.
A line in the middle of the box displaying the median, ie, the 50th percentile, of the distribution.
The box and the line give sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side
Visual points that display the observations that fall more than 1.5 time the IQR from either edge of the box. These outlying points (hence called outliers) are unusual so are plotted individually
A whisker that extend from each end of the box and goes to the farthest non-outlier point in the distribution
Below is the diagram from R4DS showing the above components and how the boxpot is created.

8.4 Visualizing Relationships
8.4.1 One Categorical + One Numerical
For each category of the categorical variable, We can use any of the plot options mentioned above for the numerical variables
8.4.2 Two Categoricals
Each category of one of the categorical variables will be placed on the x-axis (or the y-axis) by mapping it to the plot x
(or y
) aesthetic (visual property) of the geom_bar()
and the distribution of the categories of the other categorical variables by mapping it to the plot fill
aesthetic (visual property). The second variable can be shown as:
- pure counts (stacked bar chart), or
- percentages (percent stack bar chart) by setting the
position
attribute of thegeom_bar()
tofill
.
8.4.3 Two Numerical
Plot options to show the relationship between two numerical variables are:
- Scatter plot using the
geom_point()
function - trend line using
geom_smooth()
function - line graph using
geom_line()
function if one of the variables is monotonic, eg, time or date.
8.4.4 Three or More Variables
To visualize 3+ variables, We can either
- map variables to other aesthetics of the plot, eg, color, size, and shape
- split plot into facets, subplots that each display one subset of the data, based on a categorical variable using
facet_wrap()
function where its first argument is a formula created using~
followed by a (categorical) variable name.