Lab 4: Data Wrangling I
Package(s)
Schedule
- 08.00 - 08.15: Recap of Lab 3
- 08.15 - 08.30: Assignment 2 walk-through
- 08.30 - 09.00: Lecture
- 09.00 - 09.15: Break
- 09.00 - 12.00: Exercises
Learning Materials
Please prepare the following materials:
- R4DS2e book: Chapter 4, chapter 5, chapter 8, chapter 9
- Web: What is data wrangling? Intro, Motivation, Outline, Setup – Pt. 1 Data Wrangling Introduction
- Web: (NB! STOP at 7:45, i.e. skip tidyr) Tidy Data and tidyr – Pt 2 Intro to Data Wrangling with R and the Tidyverse
- Web: Data Manipulation Tools: dplyr – Pt 3 Intro to the Grammar of Data Manipulation with R
Learning Objectives
A student who has met the objectives of the session will be able to:
- Understand and apply the 6 basic dplyr verbs:
filter()
,arrange()
,select()
,mutate()
,summarise()
andgroup_by()
- Construct and apply logical statements in context with
dplyr
pipelines - Understand and apply the additional verbs
count()
,drop_na()
,View()
- Combine dplyr verbs to form a data manipulation pipeline using the pipe
|>
operator - Decipher the components and functions hereof in a
dplyr
pipeline
Exercises
Important: As we progress, avoid using black-box do-everything-with-one-command R-packages like e.g. ggpubr
- I want you to learn the underlying technical bio data science! …and why is that? Because if you use these type of packages, you will be limited to their functionality, whereas if you truly understand ggplot - The sky is the limit! Basically, it the cliché that “If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime”
Getting Started
Use this link to go to the R for Bio Data Science RStudio Cloud Server
First things first:
- Create a new Quarto Document for todays exercises, e.g.
lab04_exercises.qmd
Note that: - The Quarto document MUST be placed together with your .Rproj
file (defining, the project root - look in your Files
-pane) and also there, the data
-folder should be placed!
Understanding how paths and projects work is important. In case this is no entirely clear, a new chapter has been added on Paths and Projects.
Brief refresh of Quarto so far
Recall the syntax for a new code chunk, where all your R
code goes any text and notes must be outside the chunk tags:
```{r}
# Here the code goes
1 + 1
<- c(1, 2, 3)
x mean(x)
```
[1] 2
[1] 2
Outside the code-chunks, go our markdown, e.g.:
# Header level 1
## Header level 2
### Header level 3
*A text in italics*
**A text in bold**
Normal text describing and explaining
Now, in your new Quarto document, create a new Header level 2
, e.g.:
## Load Libraries
and under your new header, add a new code chunk, like so
```{r}
#| message: false
library("tidyverse")
```
And run the chunk. This will load our data science toolbox, including dplyr
(and ggplot
). But wait, what does the #| message: false
-part do? 🤷️ 🤔
Try to structure your Rmarkdown document as your course notes, i.e. add notes while solving the exercises aiming at creating a revisitable reference for your final project work
Bonus info: We are engineers, so of course, we love equations, we can include standard \(\LaTeX\) syntax, e.g.:
$E(x) = \frac{1}{n} \cdot \sum_{i=1}^{n} x_{i}$
Try it!
A few handy short cuts
Insert new code chunk:
- Mac: CTRL + OPTION + i
- Windows: CTRL + OPTION + i
Render my Quarto document
- Mac: CMD + SHIFT + k
- Win: CTRL + SHIFT + k
Run line in chunk
- Mac: CMD + ENTER
- Win: CTRL + ENTER
Run entire chunk
- Mac: CMD + SHIFT + ENTER
- Win: CTRL + SHIFT + ENTER
Insert the pipe symbol |>
- Mac: CMD + SHIFT + m
- Win: CTRL + SHIFT + m
Note, if you’re trying this out and you see %>%
instead of |>
, then go to Tools
, Global Options...
, Code
and check the box Use native pipeoperator
A few initial questions
First, if you don’t feel completely comfortable with the group_by |> summarise
-workflow, then no worries - Feel free to visit this short R Tutorial: Grouping and summarizing
Then, in your groups, discuss the following primer questions. Note, when asked for “what is the output”, do not run the code in the console, instead try to talk and think about it and write your answers and notes in your Quarto document for the day:
First, in a new chunk, run tibble(x = c(4, 3, 5, 1, 2))
, so you understand what it does, then - Discuss in your group, what is the output of, remember first talk, then check understanding by running code:
Q1:
tibble(x = c(4, 3, 5, 1, 2)) |> filter(x > 2)
?Q2:
tibble(x = c(4, 3, 5, 1, 2)) |> arrange(x)
?Q3:
tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(x))
?Q4:
tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(desc(x)))
?Q5:
tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(x)
?Q6:
tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(y)
?Q7:
tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x)
?Q8:
tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x, -y)
?Q9:
tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2*x)
?Q10:
tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)
?Q11:
tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_mu = mean(x))
?Q12:
tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_max = max(x))
?Q13:
tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) |> group_by(lbl) |> summarise(x_mu = mean(x), x_max = max(x))
?Q14:
tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> group_by(lbl) |> summarise(n = n())
?Q15:
tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> count(lbl)
?
In the following, return to these questions and your answers for reference on the dplyr
verbs!
Load data
Again, add a new header to your Quarto document, e.g. ## Load Data
, then:
- Go to the Vanderbilt Biostatistics Datasets site
- Find Diabetes data and download the
diabetes.csv
file - You should have a
data
-folder, if not, then in theFiles
pane, click theNew Folder
button, enter folder namedata
and clickok
- Now, click on the folder you created
- Click the
Upload
-button and navigate to thediabetes.csv
file you downloaded - Clicking the two dots
..
above the file you uploaded, look for..
, will take you one level up in your project path - Insert a new code chunk in your Quarto document
- Add and then run the following code
<- read_csv(file = "data/diabetes.csv")
diabetes_data diabetes_data
Then realise that we could simply have run the following code to do the exact same thing (Yes, readr
is pretty nifty):
# Create the data directory programmatically
dir_create(x = "data")
# Retrieve the data directly
<- read_csv(file = "https://hbiostat.org/data/repo/diabetes.csv")
diabetes_data
# Write the data to disk
write_csv(x = diabetes_data,
file = "data/diabetes.csv")
Just remember the echo/eval trick from last session to avoid retrieving online data each time you render your Quarto document
Work with the diabetes data set
Use the pipe |>
to use the View()
-function to inspect the data set. Note, if you click the -button, you will get a spreadsheet-like view of the data, allowing you to get an overview.
- Q1: How many observations and how many variables?
- Q2: Is this a tidy data set? Which three rules must be satisfied?
- Q3: When you run the chunk, then underneath each column name is stated
<chr>
and<dbl>
what is that?
Before we continue
- T1: Change the
height
,weight
,waist
andhip
from inches/pounds to the metric system (cm/kg), rounding to 1 decimal
Let us try to take a closer look at the data by various subsetting (How many… is equal to the number of rows in the subset of the data you created):
Q4: How many weigh less than 100kg?
Q5: How many weigh more than 100kg?
Q6: How many weigh more than 100kg and are less than 1.6m tall?
Q7: How many women are taller than 1.8m?
Q8: How many men are taller than 1.8m?
Q9: How many women in Louisa are older than 30?
Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?
T2: Make a scatter plot of
weight
versusheight
and colour by sex for inhabitants of Louisa above the age of 40T3: Make a boxplot of height versus location stratified on sex for people above the age of 50
Sorting columns can aid in getting an overview of variable ranges (don’t use the summary function yet for this one)
- Q11: How old is the youngest person?
- Q12: How old is the oldest person?
- Q13: Of all the 20-year olds, what is the height of the tallest?
- Q14: Of all the 20-year olds, what is the height of the shortest?
Choosing specific columns can be used to work with a subset of the data for a specific purpose
- Q15: How many columns (variables)
starts_with
a “b”? - Q16: How many columns (variables)
contains
the word “eight”?
Creating new variables is an integral part of data manipulation
- T4: Create a new variable, where you calculate the BMI
- T5: Create a
BMI_class
variable
Take a look at the following code snippet to get you started:
tibble(x = rnorm(10)) |>
mutate(trichotomised = case_when(
< -1 ~ "Less than -1",
x -1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
1 <= x ~ "Larger than or equal to 1"))
and then go read about BMI classification here and discuss in your group how to extract classifications from the Definition/Introduction section
Note, the cut()
-function could be used here, but you should try to use case_when()
as illustrated in the example chunk above.
Once you have created the variable, you will need to convert it to a categorical variable, in R
, these are called a factor
and you can set the levels like so:
<- diabetes_data |>
diabetes_data mutate(BMI_class = factor(BMI_class,
levels = c("my 1st category", "my 2nd category",
"my 3rd category", "my nth category")))
This is very important for plotting, as this will determine the order in which the categories appear on the plot!
- T6: Create a boxplot of
hdl
versus BMI_class - Q17: What do you see?
- T7: Create a
BFP
(Body fat percentage) variable
Click here for hint
BFP can be calculated usin the below equation source:
\[BFP = 1.39 \cdot BMI + 0.16 \cdot age - 10.34 \cdot sex - 9\]
Where \(sex\) is defined as being \(0\) for female and \(1\) for male.- T8: Create a
WHR
(waist-to-hip ratio) variable - Q18: Which correlate better with
BMI
,WHR
orBFP
? GROUP ASSIGNMENT
Click here for hint
Is there a certain plot-type, which can visualise if the relationship between two variables and give insights to if they are correlated? Can you perhaps use anR
-function to compute the “correlation coefficient”?. Do not use e.g. ggpubr
, use only tidyverse
and base
)
Now, with this augmented data set, let us create some summary statistics
- Q19: How many women and men are there in the data set?
- Q20: How many women and men are there from Buckingham and Louisa respectively in the data set?
- Q21: How many are in each of the
BMI_class
groups? - Q22: Given the code below, explain the difference between A and B?
# A
|>
diabetes_data ggplot(aes(x = BMI_class)) +
geom_bar()
# B
|>
diabetes_data count(BMI_class) |>
ggplot(aes(x = BMI_class, y = n)) +
geom_col()
- T9: For each
BMI_class
group, calculate the average weight and associated standard deviation - Q23: What was the average age of the women living in Buckingham in the study?
Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.