# A tibble: 5 × 6
# Groups: species, island [5]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Dream 39.5 16.7 178 3250
3 Adelie Torgersen 39.1 18.7 181 3750
4 Chinstrap Dream 46.5 17.9 192 3500
5 Gentoo Biscoe 46.1 13.2 211 4500
1 Welcome
- Sit in groups of 3-4 students where:
- at least two students you don’t know well
- at least one student who has taken STAT 155
- Introduce yourself to your group and be ready to introduce each other to class. Here are some suggestions for the introductions:
- names, pronunciation tips, and pronouns
- aspects of who you are and have been, eg, geographical identity, cultural identity, hobbies/passions)
- aspects of who you would like to be, eg, personal, professional and/or academic goals
- how you are feeling about new semester
- high point of the break.
1.1 Background
Data
The word data often brings spreadsheets to mind, like the following data on penguins.
This data is called tidy because:
- each row = a unit of observation (a penguin in this data).
- each column = a measure on some variable of interest: quantitative (numbers with units) or categorical (discrete possibilities or categories).
- each entry contains a single data value, ie, no analysis, summaries, footnotes, comments, etc. Only one value per cell.
Considering the above data, discuss the following question in your group and be ready to share your answers with the class when called.
- How many observation are there?
- How many measures are there? What are they?
- What is the
bill length in mm
of theAdelie
penguin living in theDream
island?
However, the definition of data is much more expansive, like this one from Google:
The followings are data, too!
- emails in your inbox which contain text and more
- social media posts which contain text and more
- images
- videos
- audio files
For each data example above, discuss the following questions in your group and be ready to share your answers with the class when called. We will do the first one together.
- How can the data be converted into a tidy table?
- What are the units of observation, ie, what do the rows represent?
- What are four of the possible variables we might track for each observation, ie, what goes into the columns?
- Indicate what people, groups, organizations, etc might use data like this.
Data Science
Data Science extracts knowledge from data within a particular domain of inquiry, and particular contexts. Below are examples from just within Macalester:
- Robin Shields-Cutler (Biology) uses big data to study microbiomes.
- Xavier Haro-Carrión (Geography) uses satellite & remote sensing data to study biodiversity conservation.
- Lisa Mueller (Political Science) uses data to analyze political protest outcomes.
- John Kim + Futures North (Media & Cultural Studies) uses data-driven art to explore & illustrate migration patterns.
- Bethany Miller + Macalester’s Institutional Research team use data to better understand and shape everything from student outcomes to peer school comparisons.
- Macalester Athletics uses data to study everything from training outcomes to team performance to sleep patterns.
Data Science Workflow
Though the examples above vary dramatically in their domain, context, and methodology, they share a general data science workflow shown below–we will touch on each of these elements this semester:
Step | In 112 | Beyond 112 (or as part of your project) |
---|---|---|
data collection | the basics of getting data into RStudio | web scraping, databases, APIs |
data preparation | essential wrangling skills | advanced wrangling skills, natural language processing |
data visualization | essential univariate, multivariate, and mapping viz | interactive viz, animations |
data analysis | exploratory data analysis | prediction, statistical modeling, machine learning, AI |
data storytelling | yes! | yes! |
Software
Working with modern data, hence doing Data Science, requires statistical software which means that calculators and spreadsheet functionality don’t cut it. Hence, We’ll exclusively use R and RStudio. Below are illustrative pictures of both.
Why R?
- it’s free
- it’s open source (the code is free & anybody can contribute to it)
- it has a huge online community (which is helpful for when you get stuck)
- it’s an industry standard (along with Python)
- it can be used to create reproducible and lovely documents (including this online book!)
- Fun fact: it was started by Mac alum JJ Allaire and beta-tested at Mac!
1.2 Exercises
The goal of these exercises are to:
- Familiarize yourself with the RStudio layout.
- Play around in the RStudio console to gain familiarity with the basic structure of R code.
Be Kind to Yourself
We will all make so many mistakes in RStudio! That’s part of learning any new language. If fact, mistakes are important to learning any new language.
Collaborate
We are and will be sitting in groups for a reason. Collaboration improves higher-level thinking, confidence, communication, community, & more. You are expected to:
- Actively contribute to discussion. Don’t work on your own.
- Actively include all other group members in discussion.
- Create a space where others feel comfortable making mistakes & sharing ideas.
- Stay in sync while respecting that everybody has different learning strategies, work styles, note taking strategies, etc. If some people are working on exercise 10 and others on exercise 2, that’s not a good collaboration.
- Don’t rush. You won’t hand anything in and can finish up outside of class.
Grow
This 100-level course assumes you have NO R experience, but welcomes all. Growth is expected of every student.
- If you are new to R, I hope you leave class today simply feeling positive about opening RStudio.
- If you are familiar with R, I hope you think more deeply about concepts you might have taken for granted in the past and support those new to R in your group. Explaining ideas to others deepens your own understanding and retention of these ideas.
Ask Questions
We will not discuss these exercises as a class. Your group should ask me questions as I walk around the room.
Exercise 1: Open RStudio
- If you already downloaded and installed a desktop version of RStudio on your laptop, open that.
- Otherwise, log on to Mac’s RStudio server using your usual Mac credentials: https://rstudio.macalester.edu/. For the purposes of everybody being in the same place, use this for now. You’ll be prompted to install the software later.
Notice that there are four panes, each serving a different purpose. Today, we’ll work solely within the console and will not save any work.

Exercise 2: Use R as a calculator
We can do simple calculations in RStudio! Type the following lines in the console, one by one. After each line, hit your Return/Enter button and simply take note of what you get. In some cases you might even get an error! This error is important to learning how R code does and doesn’t work.
Make sure you’re still in sync with your group.
Exercise 3: Functions and arguments
Having a calculator is nice, but we’ll typically use built-in functions to perform common (repetitive) and specific tasks. These functions have names and require information about arguments in order to run:
function(argument)
Try out the following functions in your console. Note each function’s name, the argument or information it needs to run, and what output it produces (i.e. what the function does):
Some functions require more than 1 argument, separated by commas. To keep these straight, we often specify the arguments by name:
function(argument1 = ___, argument2 = ___)
Try out the following functions in your console, one by one. Note each function’s name, the arguments it needs to run, when it’s necessary to specify these arguments by name, and what output it produces.
Finally, note that R is case sensitive. Try the following code which uses Seq()
instead of seq()
. Take time to read the error message. You will experience this type of error message a lot! It will happen any time you misspell a function (among other reasons we’ll experience later).
Exercise 4: Grammar
We’ll learn lots and lots of functions this semester. Nobody has every function memorized. That said, it does help to connect function names with their purpose. Do that for each function you used above.
Make sure you’re still in sync with your group.
Exercise 5: Your turn
Use the functions you learned above to do the following:
- Count the number of letters in “data”.
- Create the sequence 3, 6, 9, 12. You might do this 2 ways.
- Create a sequence of 4 numbers that start at 1 and end at 10. You might do this 2 ways.
- Repeat the number “5” 8 times.
- CHALLENGE: Combine 2 functions to produce the sequence 3, 6, 9, 12, 3, 6, 9, 12
Exercise 6: Save it for later
For reasons that will quickly become clear, we’ll often want to store some R output for later using assignment operator as follows:
name <- output
In the above assignment statement:
-
name
is the name under which to store a result -
output
is the result we wish to store -
<-
is the assignment operator–you can think of this as an arrow pointing theoutput
into thename
.
Try out each line one at a time. Some lines will show any output. Why?
Let’s now use what you stored! Again, do this one by one.
Finally, try to print degrees_tomorrow
. Take time to read the error message. You will experience this type of error message a lot! It will happen when you either haven’t yet defined the object you’re trying to use, or you’ve misspelled its name (among other reasons we’ll experience later).
Make sure you’re still in sync with your group.
Exercise 7: Practice
- Name and store your current age in years.
- Confirm that your age is stored correctly by typing the name and pressing Return/Enter.
- Use your stored age to calculate how old you’ll be in 17 years.
Exercise 8: Code = communication
It’s important to recognize from day 1 that code is a form of communication, both to yourself and others. Code structure and details are important to readability and clarity just as grammar, punctuation, spelling, paragraphs, and line spacing are important in written essays. All of the code below works, but has bad structure. With your group, discuss what is unfortunate about each line.
When naming your variables:
- use meaningful names,
- make them short if possible
- split up multiple-word names using
snake_case
orcamelCase
It’s also impossible, not just ill-advised to start names with numbers or symbols, or to use certain symbols in our names. Try the followings:
Exercise 9: You will make so many mistakes!
Mistakes are common when learning any new language. You’ll get better and better at interpreting error messages, finding help, and fixing errors. These are all important skills in computer programming in general.
With your cursor at the next prompt in the console (>
), press the up arrow multiple times. What does this do?! This shortcut will be very handy when you make mistakes and want to modify your code without having to start over.
You’ll often forget how functions are used. Luckily, there’s typically built-in documentation for built-in functions that can be invoked using the ?
operator.
Let’s practice:
- In the console, type
?rep
and press Return/Enter. - Check out the documentation file that pops up in the Help tab (lower right).
- Quickly scroll through, noting the type of information provided.
- Stop at the “Examples” at the bottom. Perhaps the most useful section, this is where a function’s functionality is demonstrated! Try out a couple of the provided examples in your console.
Exercise 10: History and environment
Finally, let’s leave the console.
Check out the “Environment” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?
Similarly, check out the “History” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?
Optional Excersise: Watch Explaination
If you’ve finished the above exercises, you can watch Dr. Alicia Johnson talks through the concepts learned today (YouTube).
1.3 Wrapping Up
- If you didn’t finish the activity, make sure to complete it outside of class
- Review the solutions at the bottom of the page
- Check your understanding of the content by solving checkpoint 1 in Moodle
1.4 Solutions
Click for Solutions
Exercise 2: Use R as a calculator
Exercise 3: Functions and arguments
[1] 3
[1] 5
[1] 3
[1] 10
# rep repeats the value "x" the number of "times" indicated
# Order doesn't matter
rep(x = 2, times = 5)
[1] 2 2 2 2 2
[1] 2 2 2 2 2
# We don't need to label the arguments
# But the order matters! It assumes an order of "x" then "times"
rep(2, 5)
[1] 2 2 2 2 2
[1] 5 5
# Create a sequence of numbers
# Removing the argument labels gives the same result
seq(from = 2, to = 10, by = 2)
[1] 2 4 6 8 10
[1] 2 4 6 8 10
# We can also define a sequence by its length, not increments
# But can't remove the argument labels (R assumes the 3rd argument is length)
seq(from = 2, to = 10, length = 3)
[1] 2 6 10
[1] 2 5 8
Exercise 4: Grammar
Exercise 5: Your turn
Exercise 6: Save it for later
Exercise 7: Practice
Exercise 8: Code = communication
# This is too smooshy and hard to read
seq(from=1,to=9,by=2)
# The use of spacing is inconsistent, hence hard to read
seq(from = 1, to=9,by=2)
# Too vague
my_output <- -13
# Too smooshy
thisisthetemperaturetodayincelsius <- -13
# Easier to read, but too long
this_is_the_temperature_today_in_celsius <- -13
Exercise 10: History and environment
- Environment: shows what objects you’ve stored (eg:
degrees_c
) - History: shows what R code you’ve typed