Learning Goals
The goal of this course is build confidence in carrying out the entire data science pipeline which consists of the following set of processes:
- formulation a research question,
- collection/scraping data,
- wrangling,
- modeling (covered in STAT 155, STAT 253, STAT 4xx courses),
- visualization,
- presentation and communication
Below are general skills that will be targeted and the specific topics that will be covered.
General Skills
Data Communication
In written and oral formats:
- Inform and justify data cleaning and analysis process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.
Collaborative Learning
- Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
- Develop a common purpose and agreement on goals.
- Be able to contribute questions or concerns in a respectful way.
- Share and contribute to the group’s learning in an equitable manner.
- Develop a familiarity and comfort in using collaboration tools such as Git and Github.
Course Topics
Specific learning objectives for our course topics are listed below. Use these to guide your synthesis of course material for specific topics. Note that the topics are covered in the order of the data science pipeline, not the order in which we will cover them in class.
Foundation
Intro to R, RStudio, and R Markdown
- Download and install the necessary tools (R, RStudio)
- Develop comfort in navigating the tools in RStudio
- Develop comfort in writing and rendering quarto documents
- Identify the characteristics of tidy data
- Use R code: as a calculator and to explore tidy data
Data Visualization
The learning goals may be adjusted before we start the material of this section.
Introduction to Data Visualization
- Convince ourselves about the importance of data viz.
- Understand the “grammar of graphics”
- Use
ggplot2
functions to create data viz - Understand the different basic univariate visualizations for categorical and quantitative variables
Bivariate
- Identify appropriate types of bivariate visualizations to visualize relationships between 2 variables, depending on the type of variables (categorical, quantitative)
- Create basic bivariate visualizations based on real data with
ggplot2
functions
Multivariate
- Understand how to visualize relationships between more than 2 variables.
- Add aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot with
ggplot2
functions
Spatial
- Plot data points on top of a map using
ggplot()
- Create choropleth maps using
geom_map()
- Understand the basics of creating a map using
leaflet
, including adding points and choropleths to a base map.
Effective Visualization
- Understand and apply the guiding principles of effective visualizations
Data Wrangling
Wrangling Verbs
- Understand and be able to use the following verbs appropriately:
select
,mutate
,filter
,arrange
,summarize
,group_by
- Develop an understanding of what code will do conceptually without running it
- Develop an understanding of working with dates and
lubridate
functions
Reshaping Data
- Understand the difference between wide and long data format and distinguish the cases (units of observation) for a given data set
- Be able to use
pivot_wider
andpivot_longer
from thetidyr
package
Joining Data
- Understand the concept of variables that uniquely identify rows (aka, cases or units of observations)
- Understand the different types of joins, ie, combining two data frames together
- Be able to use mutating joins:
left_join
,inner_join
andfull_join
from thedplyr
package - Be able to use filtering joins:
semi_join
,anti_join
from thedplyr
package
Working with Character Data as Factors
- Understand the difference between a variable stored as a
character
vs. afactor
- Be able to convert a
character
variable to afactor
- Be able to manipulate the order and values of a factor with the
forcats
package to improve summaries and visualizations.
Working with Character Data as Strings
- Be able to work with strings of text data
- Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the
stringr
package.
Starting a Data Project
The learning goals may be adjusted before we start the material of this section.
Data Import
- Find existing data sets
- Save data sets locally
- Load data into RStudio
- Do some preliminary data checking and cleaning steps before further wrangling and visualization:
- Make sure variables are properly formatted
- Deal with missing values
EDA
- Understand the first steps that should be taken when you encounter a new data set
- Develop comfort in knowing how to explore data to understand it
- Develop comfort in formulating research questions