<- seq(from = -3,
xy to = 3,
by = 0.01)
expand.grid(x = xy,
y = xy) |>
ggplot(
mapping = aes(
x = (1 - x - sin(y^2)),
y = (1 + y - cos(x^2)))) +
geom_point(alpha = 0.05,
shape = 20,
size = 0) +
theme_void() +
coord_polar()
Lab 6: Applying Functional Programming with Purrr to Models
Package(s)
Schedule
- 08.00 - 08.45: Recap of Lab 5
- 08.45 - 09.00: Lecture
- 09.00 - 09.15: Break
- 09.00 - 12.00: Exercises
Learning Materials
Please prepare the following materials:
- Important: Questionnaire (brief, 5-10 min): Course Midway Evaluation
- R4DS Book (Note, this is intentionally 1.ed.): Chapter 22: Introduction, Chapter 23: Model Basics, Chapter 24: Model Building, Chapter 25: Many models
- Video: Broom: Converting Statistical Models to Tidy Data Frames
- Video: Alex Hayes | Solving the model representation problem with broom | RStudio (2019)
- Video: “The Joy of Functional Programming (for Data Science)” with Hadley Wickham
- Optional: If you are completely new to statistical modelling, then click here for a primer
Learning Objectives
A student who has met the objectives of the session will be able to:
- Fit a simple linear model and interpret model parameters
- Understand and apply simple purrr-functions for element-wise function application
- Understand and apply grouped supervised models to form nested model objects
- Understand and apply the broom-functions for tidying various model objects
- Optional LO: Perform a basic principal component analysis for dimension reduction of high dimensional data
- Optional LO: Perform a basic unsupervised k-means clustering of high dimensional data
Exercises
Throwback…
Using the tibble()
function, re-create the data from this visualisation and then create your version of how this data could be visualised in a more informative manner:
Also… If you still need to be convinced of the flexibility of ggplot, try running this code:
If you are curious about what is going on here, try googling “Generative art”… Anyhoo… Let us move on…
Prologue
So far, we have worked on
- Lab 2: Genomis, the
gravier
-data from A prognostic DNA signature for T1T2 node-negative breast cancer patients - Lab 3: Metagenomics, the
pitlatrine
-data from “Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines” - Lab 4: Clinical data, the
diabetes
-data from Prevalence of coronary heart disease risk factors among rural blacks: a community-based study and A trial of church-based smoking cessation interventions for rural African Americans - Lab 5: High throughput immunoinformatics data, the
SARS-CoV-2
-data A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2
Getting Started
- Once again, please go to the course cloud RStudio server and login
- Create a new Quarto document and remember to save it
Today, we will re-examine the data acquired in Lab 2 to further explore the research question:
- What genes are significantly up-/down-regulated between the patients with- and without early metastasis?“
Load Data
Recall, so far we have adjusted the chunk setting to #| eval: true
first time we get some data from an online repository and then subsequently set #| eval: false
.
Let’s do that in a bit nicer way.
- T1: Create a new “Load Data” header in your document and add the below chunk:
<- "data/_raw/"
raw_dir <- "gravier.RData"
data_file <- "https://github.com/ramhiser/datamicroarray/raw/master/data/"
data_loc
if( !dir.exists(raw_dir) ){
dir.create(path = raw_dir)
}if( !file.exists(str_c(raw_dir, data_file)) ){
download.file(
url = str_c(data_loc, data_file),
destfile = str_c(raw_dir, data_file))
}load(file = str_c(raw_dir, data_file))
- Q1: In your group, discuss, what is going on here? Make sure you follow the difference between the first time you run this and the second!
Click here for hint
We have used the str_c()
-function before, try running e.g.:
<- "a"
x <- "b"
y str_c(x, y)
For the functions dir.exists()
and file.exists()
, the hint is in the title, they return logicals. To get a better understanding of this, try running e.g.:
<- 2
x == 2
x if( x == 2 ){
print("Yes!")
}
Then change x == 2
to x != 2
, re-run the code and see what happens
Clean Data
The next step is to clean up the data:
- Use the
ls()
-function to see what objects you have in your environment - Use the
str()
-function on the gravier data you retrieved to answer:
Q2: Discuss in your group if this is tidy data?
T2: Create a new “Clean Data” header in your document and add the below chunk:
<- gravier |>
gravier_clean bind_cols() |>
as_tibble()
Q3: Discuss in your group if this is tidy data?
Q4: In fact, specifically state what are the “rules” for tidy data?
Q5: In your group, discuss why
bind_cols
can by very very dangerous to use?
Now, moving on, let’s write the clean data to disk:
- T3: In your “Clean Data”-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called “02_gravier_clean” (with the correct filetype specification) into your “data”-folder
Click here for hint
Just as you canread
a file, you can of course also write
a file. Note the filetype we want to write here is tab-separated-values. If you in the console type e.g. readr::wr
and then hit the tab
-button, you will see the different functions for writing different filetypes. We previously did a trick to automatically gzip (compress) files?
Augment Data
- T4: Create a new “Augment Data” header in your document and add the below chunk:
<- gravier_clean |>
gravier_clean_aug mutate(y = case_when(y == "poor" ~ 1,
== "good" ~ 0)) |>
y relocate(early_metastasis = y)
Q6: In your group, discuss, what each step of the above work flow does, i.e. what are the specifics of the
dplyr
-pipeline?T5: In your “Augment Data”-section, add a new chunk, where you write a tab-separated-values gzipped (i.e. compressed) file called “03_gravier_clean_aug” (with the correct filetype specification) into your “data”-folder
Analysis
One Gene, one model
- T6: Create a new “Analysis” header in your document
Recall, in the second lab, we were looking at “our favourite gene”. In the following either look back to what was your favourite gene or choose a new 🤷️
Let’s fit our first model! If the concept of models and linear regression is unfamiliar, consider checking out the primer on linear models in R before proceeding.
- T7: Use the
lm
-function to create your first model and save it to a new variable e.g. “my_first_model”
Click here for hint
Use the formulamy_favourite_gene ~ early_metastasis
and remember when you pipe into the lm
-function, you have to specify data = _
- Q7: What are your coefficients? Mine are:
(Intercept) early_metastasis
-0.01616011 -0.03426164
T8: Use the
group_by()
\(\rightarrow\)summarise()
-workflow to calculate the mean values of the gene expression for your favourite gene stratified onearly_metastasis
Q8: What are your mean values? Mine are:
# A tibble: 2 × 2
early_metastasis mu
<dbl> <dbl>
1 0 -0.0162
2 1 -0.0504
- Q9: Discuss in your group: How are your coefficients related to your mean expression values?
Click here for hint
Recall, we have two terms here,intercept
and slope
. The intercept is the y
-value at x = 0
and the slope
is the change in y
-value, for one unit change in x
- Q10: Discuss in your group: Is your gene up- or down-regulated from
early_metastasis = 0
toearly_metastasis = 1
and use thesummary()
-function to see if is it statistically significant at a level of significance of \(\alpha = 0.05\)?
Click here for hint
Try to runmy_first_model |> summary()
and look at the estimate for early_metastasis
, i.e. the slope and also see if you can find the p-value
somewhere in this summary…
Excellent! Now we have a feeling for working with the lm()
-functions and a basic understanding of the meaning of the coefficients, when we are using linear regression to model a binary outcome. The reason that we use a linear regression model in this case is, that for these exercises, we want to investigate the relationship between variables rather than obtaining probability predictions, i.e.
- What genes are significantly up-/down-regulated between the patients with- and without early metastasis?
So, without further ado, let’s dive in!
All the Genes, all the models
First, the recent couple of years have seen an immense development in unifying the modelling interface in R
, which is notoriously inconsistent. You may be familiar with the caret
-package, the developer of which has created tidymodels, (which I really wish we had time to explore in details). In the following we will work with some of the principles for tidying model object using broom
, having object nested in tibbles and working with these using purrr
.
Now, you saw above how we could fit one model for one gene. So, we could repeat the procedure you worked through for each gene, but first consider:
- Q11: How many genes are there in the gravier data set?
Now, we just have to make one model for each gene!
Honestly, let’s not! Also, recall: “We don’t loop, we func!”, so let’s see how that would work.
Models, models everywhere…
In principle, you could overflow your environment with individual model objects, but that would require a lot of code-lines and a lot of hard-coding. But let us instead see if we can come up with something just a tad more clever.
Preparing the Data
First:
- Q12: Discuss in your group, if the
gravier_clean_aug
is a “wide” or a “long” dataset?
Once you have agreed upon and understood why, this is a wide data set, proceed and:
- T9: Create this long version of your
gravier_clean_aug
data and save it ingravier_clean_aug_long
# A tibble: 488,040 × 3
early_metastasis gene log2_expr_level
<dbl> <chr> <dbl>
1 0 g2E09 -0.00144
2 0 g7F07 -0.00144
3 0 g1A01 -0.0831
4 0 g3C09 -0.0475
5 0 g3H08 0.0158
6 0 g1A08 -0.0336
7 0 g1B01 -0.136
8 0 g1int1 0.0180
9 0 g1E11 0.0257
10 0 g8G02 0.00720
# ℹ 488,030 more rows
Click here for hint
Recall which function took a dataset from “wide”- to “long”-format and also the three parameters of interest for that function iscols
, names_to
and values_to
. If you look at my example, you will see that the columns we have pivotted have types <chr>
and <dbl>
, these will map to the names_to
- and values_to
-parameters respectively. The cols
can be used with the so-called “helper-functions”, that we have previously talked about. If you look at the gene-columns in the data, all the genes starts with a “g”, so we can use the helper functions starts_with()
and then give the argument “g” to the match
-parameter
The reason we want the “long” version of the data set is, that now we have ALL the genes defined in ONE gene
-column. This means that we can use the group_by()
-function to work per gene without looping (Recall: Don’t loop, only func!).
Now:
- T10: Create a
dplyr
-pipeline, use thegroup_by()
-function to group yourgravier_clean_aug_long
-dataset bygene
and then add thenest()
- andungroup()
-functions to your pipeline
# A tibble: 2,905 × 2
gene data
<chr> <list>
1 g2E09 <tibble [168 × 2]>
2 g7F07 <tibble [168 × 2]>
3 g1A01 <tibble [168 × 2]>
4 g3C09 <tibble [168 × 2]>
5 g3H08 <tibble [168 × 2]>
6 g1A08 <tibble [168 × 2]>
7 g1B01 <tibble [168 × 2]>
8 g1int1 <tibble [168 × 2]>
9 g1E11 <tibble [168 × 2]>
10 g8G02 <tibble [168 × 2]>
# ℹ 2,895 more rows
Note, this is conceptually a super-tricky data structure!
- Q13: Discuss in your group, what happened to the data?
Click here for hint
Try to run each of the following code chunks and examine what happens at each step:
gravier_clean_aug_long_nested
|>
gravier_clean_aug_long_nested filter(gene == "g2E09") # Replace "g2E09" with whatever was YOUR favourite gene!
|>
gravier_clean_aug_long_nested filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(data)
- Q14: Moreover, discuss in your group, what does
<tibble [168 × 2]>
mean?
Now, if you experiencing, that the server seems slow, consider if you want to proceed now with ALL the genes or just a subset. If you just want a subset, then use sample_n()
to randomly select e.g. 100 genes for further analysis. Remember you may want to use the set.seed()
function to create a reproducible work flow even when sampling.
Fitting Models
Now, recall our research question:
- What genes are significantly up-/down-regulated between the patients with- and without early metastasis?
To investigate this, we want to fit a linear model to each gene, i.e. as you did initially for your favourite gene, we want to do in a clever way per gene for ALL genes.
- T11: Use the
group_by()
-function to letR
know, that we want to work per gene
# A tibble: 2,905 × 2
# Groups: gene [2,905]
gene data
<chr> <list>
1 g2E09 <tibble [168 × 2]>
2 g7F07 <tibble [168 × 2]>
3 g1A01 <tibble [168 × 2]>
4 g3C09 <tibble [168 × 2]>
5 g3H08 <tibble [168 × 2]>
6 g1A08 <tibble [168 × 2]>
7 g1B01 <tibble [168 × 2]>
8 g1int1 <tibble [168 × 2]>
9 g1E11 <tibble [168 × 2]>
10 g8G02 <tibble [168 × 2]>
# ℹ 2,895 more rows
- T12: Then add a new line to your pipeline, where you add a new variable
model_object
to yourgravier_clean_aug_long_nested
-dataset, whichR
will compute per gene
To do this you will need the following syntax and then wrap that inside the relevant tidyverse-function for creating a new variable:
= map(.x = data,
model_object .f = ~lm(formula = log2_expr_level ~ early_metastasis,
data = .x))
Make sure to understand the map()
-function here, it is completely central to functional programming with purrr
:
- We need the
group_by()
to define which variable holds the elements to each of which we want to map model_object
is a new variable, we are creating, which will contain the result of our call to themap()
-function.x
to what exising (nested) variable are we mapping?.f
which function do we want to map to each element in the existing (nested) variable?- Note that
log2_expr_level
andearly_metastasis
are variables “inside” the nesteddata
-variable
Note, once again, this is conceptually a super-tricky data structure, not only do we have a per gene nested tibble, but now we also have a per gene nested model object - So please do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:
|>
gravier_clean_aug_long_nested filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(model_object)
Tidying Models
Excellent! Now we have a per gene model. Let us use the broom
-package to extract some information from each of the models. First, to get a better understanding of what is going on when calling the tidy()
-function, try running this:
|>
gravier_clean_aug_long_nested
# Here, you should replace "g2E09" with whatever was YOUR favourite gene!
filter(gene == "g2E09") |>
# Pull() on tibbles: This pulls out the model_object variable.
# Note! This is a list, because we nested!
pull(model_object) |>
# Pluck() on lists: From the list we got from the last step,
# we "pluck" the first element
pluck(1) |>
# The result of pluck, is a model object,
# upon which we can call the tidy function
tidy(conf.int = TRUE,
conf.level = 0.95)
Now, we want to apply this tidy()
-function per model_object
:
- T13: Scroll a bit back to where we created the
model_object
and see if you can translate that into mapping thetidy()
-function to themodel_object
-variable, thereby creating a new variablemodel_object_tidy
- This is tricky, so do make sure to discuss in your group how this can be done!
Click here for hint
Remember the parameters to thetidy()
-functions, which gives us the confidence intervals and defines corresponding level. Also, everything you need is in the previous layout of the map()
-function. Note, that the calls to the pull()
- and pluck()
-functions above, pertains to extracting from a tibble, which we do not want, on the contrary, we want all objects to be contained in our gravier_clean_aug_long_nested
data
# A tibble: 2,905 × 4
# Groups: gene [2,905]
gene data model_object model_object_tidy
<chr> <list> <list> <list>
1 g2E09 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
2 g7F07 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
3 g1A01 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
4 g3C09 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
5 g3H08 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
6 g1A08 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
7 g1B01 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
8 g1int1 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
9 g1E11 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
10 g8G02 <tibble [168 × 2]> <lm> <tibble [2 × 7]>
# ℹ 2,895 more rows
Note, once again, this is conceptually a super-tricky data structure, not only do we have a per gene nested tibble, but now we also have a per gene nested model object and now also a nested tibble of tidyed objects - So please again do make sure to discuss in your group, what is going on here, e.g. try running this and discuss what you see:
|>
gravier_clean_aug_long_nested filter(gene == "g2E09") |> # Replace "g2E09" with whatever was YOUR favourite gene!
pull(model_object_tidy)
Wrangling
We’re almost there - Just a bit of wrangling to go!
Just as you saw that we could nest()
on a variable (recall we did that for the gene
-variable), you can do the opposite and lo and behold, that is the unnest()
-function. Before we continue:
- T14: Create a
dplyr
-pipeline and save the result in a new variable calledgravier_estimates
: Use theunnest()
-function to unpack themodel_object_tidy
# A tibble: 5,810 × 10
# Groups: gene [2,905]
gene data model_object term estimate std.error statistic p.value
<chr> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 g2E09 <tibble> <lm> (Intercept) -0.0162 0.0117 -1.38 1.69e-1
2 g2E09 <tibble> <lm> early_metas… -0.0343 0.0201 -1.71 8.99e-2
3 g7F07 <tibble> <lm> (Intercept) 0.0604 0.0139 4.35 2.36e-5
4 g7F07 <tibble> <lm> early_metas… -0.0185 0.0238 -0.778 4.38e-1
5 g1A01 <tibble> <lm> (Intercept) -0.0290 0.0123 -2.35 1.99e-2
6 g1A01 <tibble> <lm> early_metas… -0.0367 0.0211 -1.73 8.47e-2
7 g3C09 <tibble> <lm> (Intercept) 0.0518 0.0145 3.58 4.55e-4
8 g3C09 <tibble> <lm> early_metas… -0.0148 0.0249 -0.595 5.53e-1
9 g3H08 <tibble> <lm> (Intercept) 0.0142 0.0128 1.11 2.69e-1
10 g3H08 <tibble> <lm> early_metas… 0.00247 0.0220 0.112 9.11e-1
# ℹ 5,800 more rows
# ℹ 2 more variables: conf.low <dbl>, conf.high <dbl>
- T15: The again, create a
dplyr
-pipeline and save the result in a the samegravier_estimates
-variable: Subset the rows to only get the slope term and then choose variables as displayed below, finally end with un-grouping your data, as we no longer need the groups
# A tibble: 2,905 × 5
gene p.value estimate conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 g2E09 0.0899 -0.0343 -0.0739 0.00539
2 g7F07 0.438 -0.0185 -0.0655 0.0285
3 g1A01 0.0847 -0.0367 -0.0784 0.00508
4 g3C09 0.553 -0.0148 -0.0639 0.0343
5 g3H08 0.911 0.00247 -0.0410 0.0459
6 g1A08 0.859 -0.00363 -0.0438 0.0366
7 g1B01 0.279 -0.0218 -0.0615 0.0178
8 g1int1 0.666 -0.0113 -0.0627 0.0402
9 g1E11 0.106 -0.0329 -0.0728 0.00703
10 g8G02 0.633 -0.0108 -0.0555 0.0338
# ℹ 2,895 more rows
- T16: To your
gravier_estimates
-dataset, add a variableq.value
, which is the result of calling thep.adjust()
-function on yourp.value
-variable and also add an indicator variable denoting if a given gene is significant or not
# A tibble: 2,905 × 7
gene p.value estimate conf.low conf.high q.value is_significant
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 g2E09 0.0899 -0.0343 -0.0739 0.00539 1 no
2 g7F07 0.438 -0.0185 -0.0655 0.0285 1 no
3 g1A01 0.0847 -0.0367 -0.0784 0.00508 1 no
4 g3C09 0.553 -0.0148 -0.0639 0.0343 1 no
5 g3H08 0.911 0.00247 -0.0410 0.0459 1 no
6 g1A08 0.859 -0.00363 -0.0438 0.0366 1 no
7 g1B01 0.279 -0.0218 -0.0615 0.0178 1 no
8 g1int1 0.666 -0.0113 -0.0627 0.0402 1 no
9 g1E11 0.106 -0.0329 -0.0728 0.00703 1 no
10 g8G02 0.633 -0.0108 -0.0555 0.0338 1 no
# ℹ 2,895 more rows
But… q.value
??? What are you on about??? Here is a nice primer on “How does multiple testing correction work?”
Recap
If you understandbly by now, have lost a bit of overview of what is going on, let’s just re-iterate.
- We have the
gravier
-dataset, with the log2-expression levels for 2,905 genes of 168 patients of whom 111 did not have early metastasis and 57 who did - We are interested in investigating what genes are significantly up-/down-regulated between the patients with- and without early metastasis
- First we retrieved the data from the data repository, cleaned and augmented it and saved it to disk
- Then pivotted the data, so we could work per gene (The
gene
-variable) - Next, we grouped per gene and nested the data (The
data
-variable) - Then, we fitted a linear model to each gene (The
model_object
-variable) - Next, we used the
broom
-package to tidy the fitted model incl. getting confidence intervals (Themodel_object_tidy
-variable) - Lastly, we extracted the model parameters, corrected for multiple testing and added and indicator for significant findings
Now, we actually have everything we need to answer:
- What genes are significantly up-/down-regulated between the patients with- and without early metastasis?“
In the following, we will use a level of significance of \(\alpha = 0.05\) to provide this answer.
Visualise
Right, let’s get to it!
- T17: Re-create this
forest-plot
to finally reveal the results of your analysis GROUP ASSIGNMENT part I
Click here for hint
Here, you will have to use your indicator variable to identify significant genes before plotting and then it would probably be prudent to take a look at thefct_reorder()
-function and geom_errorbarh()
-representation
- T18: Re-create this
volcano-plot
to finally reveal the results of your analysis GROUP ASSIGNMENT part II
Click here for hint
Here,before plotting you will have to create a new label-variable which takes valuegene
for significant genes and otherwise is simply an empty string, which we denote by ""
. Also, perhaps the ggrepel
-package would be relevant for somehow adding text/labels
GROUP ASSIGNMENT
For the group assignment this time, you will use T17 and T18 to again create a reproducible micro-report and make sure to:
Optional
The following is only if you can find the time! But, perhaps this would be something interesting to revisit during the project period
PCA
- Go and visit this blog post by Claus O. Wilke, Professor of Integrative Biology.
- Your task is to work through the blog post using the
gravier
-dataset to crate a PCA-analysis
K-means
- Go and visit this K-means clustering with tidy data principles post
- Your task is to work through the blog post using the
gravier
-dataset to crate a kmeans-analysis