Lab 4: Data Wrangling I

Package(s)

Schedule

Learning Materials

Please prepare the following materials:

Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book

Learning Objectives

A student who has met the objectives of the session will be able to:

  • Understand and apply the 6 basic dplyr verbs: filter(), arrange(), select(), mutate(), summarise() and group_by()
  • Construct and apply logical statements in context with dplyr pipelines
  • Understand and apply the additional verbs count(), drop_na(), View()
  • Combine dplyr verbs to form a data manipulation pipeline using the pipe |> operator
  • Decipher the components and functions hereof in a dplyr pipeline

Exercises

Important: As we progress, avoid using black-box do-everything-with-one-command R-packages like e.g. ggpubr - I want you to learn the underlying technical bio data science! …and why is that? Because if you use these types and-then-magic-happens packages, you will be limited to their functionality and what if you want to create something custom? Something beyond stock? Some awesome idea that you got? If you truly understand ggplot - the sky is the limit! Basically, it’s the cliché: “If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime”

Getting Started

First, make sure to read and discuss the feedback you got from last week’s assignment!

Then, once again go to the R for Bio Data Science RStudio Cloud Server

First things first:

  • Let’s create a new Quarto Document for today’s exercises, e.g. lab04_exercises.qmd

Brief refresh of Quarto so far

Recall the syntax for a new code chunk, where all your R code goes. Any text and notes must be outside the chunk tags:

```{r}
# Here the code goes
1 + 1
x <- c(1, 2, 3)
mean(x)
```
[1] 2
[1] 2

Our markdown goes outside the code-chunks, e.g.:

  # Header level 1
  ## Header level 2
  ### Header level 3
  *A text in italics*
  **A text in bold**
  Normal text describing and explaining

Now, in your new Quarto document, create a new Header level 2, e.g.:

  ## Load Libraries

and under your new header, add a new code chunk, like so

```{r}
#| message: false
library("tidyverse")
```

And run the chunk. This will load our data science toolbox, including dplyr (and ggplot). But wait, what does #| message: false do? 🤷️ 🤔

Try to structure your Quarto document as your course notes, i.e. add notes while solving the exercises aiming to create a reference that you can revisit for your final project work… And… For future reference after this course, where hopefully you’ll really reap what you sow

Bonus info: We are engineers, so of course, we love equations, we can include standard \(\LaTeX\) syntax, e.g.:

  $E(x) = \frac{1}{n} \cdot \sum_{i=1}^{n} x_{i}$

Try it!

A few handy short cuts

Insert new code chunk:

  • Mac: CMD + OPTION + i
  • Windows: CTRL + ALT + i

Render my Quarto document:

  • Mac: CMD + SHIFT + k
  • Win: CTRL + SHIFT + k

Run line in chunk:

  • Mac: CMD + ENTER
  • Win: CTRL + ENTER

Run entire chunk:

  • Mac: CMD + SHIFT + ENTER
  • Win: CTRL + SHIFT + ENTER

Insert the pipe symbol |>:

  • Mac: CMD + SHIFT + m
  • Win: CTRL + SHIFT + m

Note, if you’re trying this out and you see %>% instead of |>, then go to Tools, Global Options..., Code and check the box Use native pipeoperator, |>

A few initial questions

First, if you don’t feel completely comfortable with the group_by |> summarise workflow, then no worries - Feel free to visit this short R Tutorial: Grouping and summarizing

Then, in your groups, discuss the following primer questions. Note, when asked for “what is the output”, do not run the code in the console, instead try to talk and think about it and write your answers and notes in your Quarto document for the day:

First, in a new chunk, run tibble(x = c(4, 3, 5, 1, 2)), so you understand what it does, then - Discuss in your group, what is the output of, remember first talk, then check understanding by running code:

  • Q1: tibble(x = c(4, 3, 5, 1, 2)) |> filter(x > 2)?

  • Q2: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(x)?

  • Q3: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(x))?

  • Q4: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(desc(x)))?

  • Q5: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(x)?

  • Q6: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(y)?

  • Q7: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x)?

  • Q8: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x, -y)?

  • Q9: tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2*x)?

  • Q10: tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)?

  • Q11: tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_mu = mean(x))?

  • Q12: tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_max = max(x))?

  • Q13: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) |> group_by(lbl) |> summarise(x_mu = mean(x), x_max = max(x))?

  • Q14: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> group_by(lbl) |> summarise(n = n())?

  • Q15: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> count(lbl)?

In the following, return to these questions and your answers for reference on the dplyr verbs!

Load data

Wait… If your write some R-code, which for some reason does not run, R will let you know what went wrong. Make sure to read these error messages. During this course the teaching team is available, but after the course, you’ll have to be comfortable with debugging, i.e. finding and fixing errors in your code. By-the-way, did you know the term debugging is attested to Admiral Grace Hopper?

Let’s continue - Again, add a new header to your Quarto document, e.g. ## Load Data, then:

  1. Go to the Vanderbilt Biostatistics Datasets site
  2. Find Diabetes data and download the diabetes.csv file
  3. You should have a data folder, if not, then in the Files pane, click the New Folder button, enter folder name data and click ok
  4. Now, click on the folder you created
  5. Click the Upload button and navigate to the diabetes.csv file you downloaded
  6. Clicking the two dots .. above the file you uploaded, look for .., will take you one level up in your project path
  7. Insert a new code chunk in your Quarto document
  8. Add and then run the following code
diabetes_data <- read_csv(file = data/diabetes.csv)
diabetes_data

Then realise that we could simply have run the following code to do the exact same thing (Yes, readr is pretty nifty):

# Create the data directory programmatically
dir.create(path = "data")

# Retrieve the data directly
diabetes_data <- read_csv(file = "https://hbiostat.org/data/repo/diabetes.csv")

# Write the data to disk
write_csv(x = diabetes_data,
          file = "data/diabetes.csv")

Just remember the echo/eval trick from last session to avoid retrieving online data each time you render your Quarto document

Work with the diabetes data set

First, take a few minutes to read about this dataset. Go to the Vanderbilt Department of Biostatistics site and find the Diabetes data header and directly below that, click the link diabetes.html. This will contain some meta data on the data.

Now, use the pipe |> to use the View() function to inspect the data set. Note, if you click the -button, you will get a spreadsheet-like view of the data, allowing you to get an overview (Psst… No need to use Excel…).

  • Q1: How many observations and how many variables?
  • Q2: Is this a tidy data set? Which three rules must be satisfied?
  • Q3: When you run the chunk, then underneath each column name is stated <chr> and <dbl> what is that?

Before we continue

  • T1: Change the height, weight, waist and hip from the imperial system (inches/pounds) to the metric system (cm/kg), rounding to 1 decimal

Let us try to take a closer look at various subsets of the data. For the following questions, “How many …” refers to the number of rows in the subset of the data you create:

  • Q4: How many weigh less than 100kg?

  • Q5: How many weigh more than 100kg?

  • Q6: How many weigh more than 100kg and are less than 1.6m tall?

  • Q7: How many women are taller than 1.8m?

  • Q8: How many men are taller than 1.8m?

  • Q9: How many women in Louisa are older than 30?

  • Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?

  • T2: Make a scatter plot of weight versus height and colour by sex for inhabitants of Louisa above the age of 40

  • T3: Make a box plot of height versus location stratified on sex for people above the age of 50

Sorting columns can aid in getting an overview of variable ranges (don’t use the summary() function yet for this one)

  • Q11: How old is the youngest person?
  • Q12: How old is the oldest person?
  • Q13: Of all the 20-year-olds, what is the height of the tallest?
  • Q14: Of all the 20-year-olds, what is the height of the shortest?

Choosing specific columns can be used to work with a subset of the data for a specific purpose

  • Q15: How many columns (variables) starts_with a “b”?
  • Q16: How many columns (variables) contains the word “eight”?

Creating new variables is an integral part of data manipulation

  • T4: Create a new variable, where you calculate the BMI
  • T5: Create a BMI_class variable

Take a look at the following code snippet to get you started:

tibble(x = rnorm(10)) |> 
  mutate(trichotomised = case_when(
    x < -1 ~ "Less than -1",
    -1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
    1 <= x ~ "Larger than or equal to 1"))

and then go read about BMI classification here and discuss in your group how to extract classifications from the Definition/Introduction section

Note, the cut() function could be used here, but you should try to use case_when() as illustrated in the example chunk above.

Once you have created the variable, you will need to convert it to a categorical variable, in R, these are called a factor and you can set the levels like so:

diabetes_data <- diabetes_data |>
  mutate(BMI_class = factor(BMI_class,
                            levels =  c("my 1st category", "my 2nd category",
                                        "my 3rd category", "my nth category")))

This is very important for plotting, as this will determine the order in which the categories appear on the plot!

  • T6: Create a box plot of hdl versus BMI_class
  • Q17: What do you see?
  • T7: Create a BFP (Body fat percentage) variable

Click here for hint

BFP can be calculated using the following equation from Jackson et al. (2002):

\[BFP = 1.39 \cdot BMI + 0.16 \cdot age - 10.34 \cdot sex - 9\]

where \(sex\) is defined as being \(0\) for female and \(1\) for male.
  • T8: Create a WHR (waist-to-hip ratio) variable
  • Q18: What correlates better with BMI: WHR or BFP? GROUP ASSIGNMENT (Important, see: how to)

Click here for hint

Is there a certain plot-type, which can visualise if the relationship between two variables and give insights to if they are correlated? Can you perhaps use an R function to compute the “correlation coefficient”?. Do not use e.g. ggpubr or any other and-then-magic-happens package, use the knowledge you’ve gained so far)

Now, with this augmented data set, let us create some summary statistics

  • Q19: How many women and men are there in the data set?
  • Q20: How many women and men are there from Buckingham and Louisa respectively in the data set?
  • Q21: How many are in each of the BMI_class groups?
  • Q22: Given the code below, explain the difference between A and B?
# A
diabetes_data |>
  ggplot(aes(x = BMI_class)) +
  geom_bar()

# B
diabetes_data |>
  count(BMI_class) |>
  ggplot(aes(x = BMI_class, y = n)) +
  geom_col()
  • T9: For each BMI_class group, calculate the average weight and associated standard deviation
  • Q23: What was the average age of the women living in Buckingham in the study?

Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.