Lab 4: Data Wrangling I

Package(s)

Schedule

Learning Materials

Please prepare the following materials:

Learning Objectives

A student who has met the objectives of the session will be able to:

  • Understand and apply the 6 basic dplyr verbs: filter(), arrange(), select(), mutate(), summarise() and group_by()
  • Construct and apply logical statements in context with dplyr pipelines
  • Understand and apply the additional verbs count(), drop_na(), View()
  • Combine dplyr verbs to form a data manipulation pipeline using the pipe |> operator
  • Decipher the components and functions hereof in a dplyr pipeline

Exercises

Important: As we progress, avoid using black-box do-everything-with-one-command R-packages like e.g. ggpubr - I want you to learn the underlying technical bio data science! …and why is that? Because if you use these type of packages, you will be limited to their functionality, whereas if you truly understand ggplot - The sky is the limit! Basically, it the cliché that “If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime”

Getting Started

Use this link to go to the R for Bio Data Science RStudio Cloud Server

First things first:

  • Create a new Quarto Document for todays exercises, e.g. lab04_exercises.qmd

Note that: - The Quarto document MUST be placed together with your .Rproj file (defining, the project root - look in your Files-pane) and also there, the data-folder should be placed!

Understanding how paths and projects work is important. In case this is no entirely clear, a new chapter has been added on Paths and Projects.

Brief refresh of Quarto so far

Recall the syntax for a new code chunk, where all your R code goes any text and notes must be outside the chunk tags:

```{r}
# Here the code goes
1 + 1
x <- c(1, 2, 3)
mean(x)
```
[1] 2
[1] 2

Outside the code-chunks, go our markdown, e.g.:

  # Header level 1
  ## Header level 2
  ### Header level 3
  *A text in italics*
  **A text in bold**
  Normal text describing and explaining

Now, in your new Quarto document, create a new Header level 2, e.g.:

  ## Load Libraries

and under your new header, add a new code chunk, like so

```{r}
#| message: false
library("tidyverse")
```

And run the chunk. This will load our data science toolbox, including dplyr (and ggplot). But wait, what does the #| message: false-part do? 🤷️ 🤔

Try to structure your Rmarkdown document as your course notes, i.e. add notes while solving the exercises aiming at creating a revisitable reference for your final project work

Bonus info: We are engineers, so of course, we love equations, we can include standard \(\LaTeX\) syntax, e.g.:

  $E(x) = \frac{1}{n} \cdot \sum_{i=1}^{n} x_{i}$

Try it!

A few handy short cuts

Insert new code chunk:

  • Mac: CTRL + OPTION + i
  • Windows: CTRL + OPTION + i

Render my Quarto document

  • Mac: CMD + SHIFT + k
  • Win: CTRL + SHIFT + k

Run line in chunk

  • Mac: CMD + ENTER
  • Win: CTRL + ENTER

Run entire chunk

  • Mac: CMD + SHIFT + ENTER
  • Win: CTRL + SHIFT + ENTER

Insert the pipe symbol |>

  • Mac: CMD + SHIFT + m
  • Win: CTRL + SHIFT + m

Note, if you’re trying this out and you see %>% instead of |>, then go to Tools, Global Options..., Code and check the box Use native pipeoperator

A few initial questions

First, if you don’t feel completely comfortable with the group_by |> summarise-workflow, then no worries - Feel free to visit this short R Tutorial: Grouping and summarizing

Then, in your groups, discuss the following primer questions. Note, when asked for “what is the output”, do not run the code in the console, instead try to talk and think about it and write your answers and notes in your Quarto document for the day:

First, in a new chunk, run tibble(x = c(4, 3, 5, 1, 2)), so you understand what it does, then - Discuss in your group, what is the output of, remember first talk, then check understanding by running code:

  • Q1: tibble(x = c(4, 3, 5, 1, 2)) |> filter(x > 2)?

  • Q2: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(x)?

  • Q3: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(x))?

  • Q4: tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(desc(x)))?

  • Q5: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(x)?

  • Q6: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(y)?

  • Q7: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x)?

  • Q8: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x, -y)?

  • Q9: tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2*x)?

  • Q10: tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)?

  • Q11: tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_mu = mean(x))?

  • Q12: tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_max = max(x))?

  • Q13: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) |> group_by(lbl) |> summarise(x_mu = mean(x), x_max = max(x))?

  • Q14: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> group_by(lbl) |> summarise(n = n())?

  • Q15: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> count(lbl)?

In the following, return to these questions and your answers for reference on the dplyr verbs!

Load data

Again, add a new header to your Quarto document, e.g. ## Load Data, then:

  1. Go to the Vanderbilt Biostatistics Datasets site
  2. Find Diabetes data and download the diabetes.csv file
  3. You should have a data-folder, if not, then in the Files pane, click the New Folder button, enter folder name data and click ok
  4. Now, click on the folder you created
  5. Click the Upload-button and navigate to the diabetes.csv file you downloaded
  6. Clicking the two dots .. above the file you uploaded, look for .., will take you one level up in your project path
  7. Insert a new code chunk in your Quarto document
  8. Add and then run the following code
diabetes_data <- read_csv(file = "data/diabetes.csv")
diabetes_data

Then realise that we could simply have run the following code to do the exact same thing (Yes, readr is pretty nifty):

# Create the data directory programmatically
dir_create(x = "data")

# Retrieve the data directly
diabetes_data <- read_csv(file = "https://hbiostat.org/data/repo/diabetes.csv")

# Write the data to disk
write_csv(x = diabetes_data,
          file = "data/diabetes.csv")

Just remember the echo/eval trick from last session to avoid retrieving online data each time you render your Quarto document

Work with the diabetes data set

Use the pipe |> to use the View()-function to inspect the data set. Note, if you click the -button, you will get a spreadsheet-like view of the data, allowing you to get an overview.

  • Q1: How many observations and how many variables?
  • Q2: Is this a tidy data set? Which three rules must be satisfied?
  • Q3: When you run the chunk, then underneath each column name is stated <chr> and <dbl> what is that?

Before we continue

  • T1: Change the height, weight, waist and hip from inches/pounds to the metric system (cm/kg), rounding to 1 decimal

Let us try to take a closer look at the data by various subsetting (How many… is equal to the number of rows in the subset of the data you created):

  • Q4: How many weigh less than 100kg?

  • Q5: How many weigh more than 100kg?

  • Q6: How many weigh more than 100kg and are less than 1.6m tall?

  • Q7: How many women are taller than 1.8m?

  • Q8: How many men are taller than 1.8m?

  • Q9: How many women in Louisa are older than 30?

  • Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?

  • T2: Make a scatter plot of weight versus height and colour by sex for inhabitants of Louisa above the age of 40

  • T3: Make a boxplot of height versus location stratified on sex for people above the age of 50

Sorting columns can aid in getting an overview of variable ranges (don’t use the summary function yet for this one)

  • Q11: How old is the youngest person?
  • Q12: How old is the oldest person?
  • Q13: Of all the 20-year olds, what is the height of the tallest?
  • Q14: Of all the 20-year olds, what is the height of the shortest?

Choosing specific columns can be used to work with a subset of the data for a specific purpose

  • Q15: How many columns (variables) starts_with a “b”?
  • Q16: How many columns (variables) contains the word “eight”?

Creating new variables is an integral part of data manipulation

  • T4: Create a new variable, where you calculate the BMI
  • T5: Create a BMI_class variable

Take a look at the following code snippet to get you started:

tibble(x = rnorm(10)) |> 
  mutate(trichotomised = case_when(
    x < -1 ~ "Less than -1",
    -1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
    1 <= x ~ "Larger than or equal to 1"))

and then go read about BMI classification here and discuss in your group how to extract classifications from the Definition/Introduction section

Note, the cut()-function could be used here, but you should try to use case_when() as illustrated in the example chunk above.

Once you have created the variable, you will need to convert it to a categorical variable, in R, these are called a factor and you can set the levels like so:

diabetes_data <- diabetes_data |>
  mutate(BMI_class = factor(BMI_class,
                            levels =  c("my 1st category", "my 2nd category",
                                        "my 3rd category", "my nth category")))

This is very important for plotting, as this will determine the order in which the categories appear on the plot!

  • T6: Create a boxplot of hdl versus BMI_class
  • Q17: What do you see?
  • T7: Create a BFP (Body fat percentage) variable

Click here for hint

BFP can be calculated usin the below equation source:

\[BFP = 1.39 \cdot BMI + 0.16 \cdot age - 10.34 \cdot sex - 9\]

Where \(sex\) is defined as being \(0\) for female and \(1\) for male.
  • T8: Create a WHR (waist-to-hip ratio) variable
  • Q18: Which correlate better with BMI, WHR or BFP? GROUP ASSIGNMENT

Click here for hint

Is there a certain plot-type, which can visualise if the relationship between two variables and give insights to if they are correlated? Can you perhaps use an R-function to compute the “correlation coefficient”?. Do not use e.g. ggpubr, use only tidyverse and base)

Now, with this augmented data set, let us create some summary statistics

  • Q19: How many women and men are there in the data set?
  • Q20: How many women and men are there from Buckingham and Louisa respectively in the data set?
  • Q21: How many are in each of the BMI_class groups?
  • Q22: Given the code below, explain the difference between A and B?
# A
diabetes_data |>
  ggplot(aes(x = BMI_class)) +
  geom_bar()

# B
diabetes_data |>
  count(BMI_class) |>
  ggplot(aes(x = BMI_class, y = n)) +
  geom_col()
  • T9: For each BMI_class group, calculate the average weight and associated standard deviation
  • Q23: What was the average age of the women living in Buckingham in the study?

Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.