Lab 5: Data Wrangling II
Package(s)
Schedule
- 08.00 - 08.30: Recap of Lab 4
- 08.30 - 08.35: Lecture
- 08.35 - 08.45: Break
- 08.45 - 12.00: Exercises
Learning Materials
Please prepare the following materials
- R4DS2e book: Chapter 6: Data Tidying, Chapter 15: Strings, Chapter 17: Factors, Chapter 20: Joins
- Video: Tidy Data and tidyr - NB! Start at 7:45 and please note:
gather()
is nowpivot_longer()
andspread()
is nowpivot_wider()
- Video: Working with Two Datasets: Binds, Set Operations, and Joins
- Video: stringr (Playlist with 7 short videos)
Learning Objectives
A student who has met the objectives of the session will be able to:
- Understand and apply the various
str_*()
-functions for string manipulation - Understand and apply the family of
*_join()
-functions for combining data sets - Understand and apply
pivot_wider()
andpivot_longer()
- Use factors in context with plotting categorical data using
ggplot
Exercises
Prologue
Today will not be easy! But please try to remember Hadley’s word-of-advise:
- “The bad news is, whenever you’re learning a new tool, for a long time, you’re going to suck! It’s gonna be very frustrating! But the good news is that that is typical and something that happens to everyone and it’s only temporary! Unfortunately, there is no way to going from knowing nothing about the subject to knowing something about a subject and being an expert in it without going through a period of great frustration and much suckiness! Keep pushing through!” - H. Wickham (dplyr tutorial at useR 2014, 4:10 - 4:48)
Intro
We are upping the game here, so expect to get stuck at some of the questions. Remember - Discuss with your group how to solve the task, revisit the materials you prepared for today and naturally, the TAs and I are happy to nudge you in the right direction. Finally, remember… Have fun!
Remember what you have worked on so far:
- RStudio
- Quarto
- ggplot
- filter
- arrange
- select
- mutate
- group_by
- summarise
- The pipe and creating pipelines
- stringr
- joining data
- pivotting data
That’s quite a lot! Well done - You’ve come quite far already! Remember to think about the above tools in the following as we will synthesise your learnings so far into an analysis!
Background
In the early 20s, the world was hit by the coronavirus disease 2019 (COVID-19) pandemic. The pandemic was caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In Denmark the virus first confirmed case was on 27 February 2020.
While initially very little was known about the SARS-CoV-2 virus, we did know the general pathology of vira. Briefly, the virus invades the cells and hijacks the intra-cellular machinery. Using the hijacked machinery, components for new virus particles are produced, eventually being packed into the viral envelope and released from the infected cell. Some of these components, viral proteins, is broken down into smaller fragments called peptides by the proteasome. These peptides are transported into the endoplasmatic reticulum by the Transporter Associated with antigen Processing (TAP) protein complex. Here, they are aided by chaperones bound to the Major Histocompatilibty Complex class I (MHCI) and then across the Golgi Aparatus they finally get displayed on the surface of the cells. Note, in humans, MHC is also called Human Leukocyte Antigen (HLA) and represents the most diverse genes. Each of us have a total of 6 HLA-alleles, 3 from the maternal and 3 from the paternal side. These are further divided into 3 classes HLA-A, HLA-B and HLA-C and the combination of these constitute the HLA-haplotype for an individual. Once the peptide is bound to the MHC Class I at the cell surface and exposed, the MHCI-peptide complex can be recognised by CD8+ Cytotoxic T-Lymphocytes (CTLs) via the T-cell Receptor (TCR). If a cell displays peptides of viral origin, the CTL gets activated and via a cascade induces apoptosis (programmed cell death) of the infected cell. The proces is summarised in the figure below.
Image source: 10.3389/fmicb.2015.00021
The data we will be working with today contains data on sequenced T-cell receptors, viral antigens, HLA-haplotypes and clinical meta data for a cohort:
Your Task Today
Today, we will emulate the situation, where you are working as a Bioinformatician / Bio Data Scientist and you have been given the data and the task of answering these two burning questions:
- What characterises the peptides binding to the HLAs?
- What characterises T-cell Receptors binding to the pMHC-complexes?
GROUP ASSIGNMENT: Today, your assignment will be to create a micro-report on these 2 questions!
Getting Started
- Click here to go to the course RStudio cloud server and login
- Make sure you are in your
r_for_bio_data_science
-project, you can verify this in the upper right corner - In the same place as your
r_for_bio_data_science.Rproj
-file and existingdata
-folder, create a new folder and name itdoc
- Go to the aforementioned manuscript. Download the PDF and upload it to your new
doc
-folder - Open the PDF and find the link to the data
- Go to the data site (Note, you may have to create and account to download, shouldn’t take too long) . Find and download the file
ImmuneCODE-MIRA-Release002.1.zip
(CAREFUL, do not download the superseded files) - Unpack the downloaded file
- Find the files
peptide-detail-ci.csv
andsubject-metadata.csv
and compress to.zip
-files - Upload the compressed
peptide-detail-ci.csv.zip
- andsubject-metadata.csv.zip
-files to yourdata
-folder in your RStudio Cloud session - Finally, once again, create a new Quarto document for today’s exercises, containing the sections:
- Background
- Aim
- Load Libraries
- Load Data
- Data Description
- Analysis
Creating the Micro-Report
Background
Feel free to copy paste the one stated in the background-section above
Aim
State the aim of the micro-report, i.e. what are the questions you are addressing?
Load Libraries
Load the libraries needed
Load Data
Read the two data sets into variables peptide_data
and meta_data
.
Click here for hint
Think about which Tidyverse package deals with reading data and what are the file types we want to read here?Data Description
It is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview
The Subject Meta Data
Let’s take a look at the meta data:
|>
meta_data sample_n(10)
# A tibble: 10 × 30
Experiment Subject `Cell Type` `Target Type` Cohort Age Gender Race
<chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 eJL154 83 PBMC C19_cI COVID-19-Exp… 35 F Nati…
2 eJL164 466 PBMC C19_cI COVID-19-B-N… 33 M White
3 ePD79 2162 PBMC C19_cI COVID-19-Con… 79 F <NA>
4 eLH49 7517 PBMC C19_cI COVID-19-Con… 76 M <NA>
5 eLH44 6501 PBMC C19_cI COVID-19-Con… 61 F <NA>
6 eHO132 26 PBMC C19_cI COVID-19-Con… 65 F White
7 eHH173 19829 naive_CD8 C19_cI Healthy (No … 50 M White
8 eAV100 1995 PBMC C19_cII COVID-19-Con… 29 F <NA>
9 eQD113 7477 PBMC C19_cI COVID-19-Con… 36 M <NA>
10 eEE226 19617 naive_CD8 C19_cI Healthy (No … 21 F White
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
# `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
# `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
# DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
# DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
# DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,
# DRB5...30 <chr>
Q1: How many observations of how many variables are in the data?
Q2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?
T1: Re-create this plot
Read this first:
- Think about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting. Recall, that we can stick together a
dplyr
-pipeline with a call toggplot
, so here we will have tocount
ofCohort
andGender
before plotting
Does your plot look different somehow? Consider peeking at the hint…
Click here for hint
Perhaps not everyone agrees on how to denoteNA
s in data. I have seen -99
, -11
, _
and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read
-function. Recall, how can we get information on the parameters of a ?function
- T2: Re-create this plot
Click here for hint
Perhaps there is a function, which cancut
continuous observations into a set of bins?
STOP! Make sure you handled how NA
s are denoted in the data before proceeding, see hint below T1
- T3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables
Subject
,Cell Type
andTarget Type
|>
meta_data sample_n(10)
# A tibble: 10 × 27
Experiment Cohort Age Gender Race `HLA-A...9` `HLA-A...10` `HLA-B...11`
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 eXL32 Healthy … 37 F White "A*01:01" "A*02:01" "B*15:01"
2 eJL161 COVID-19… 31 F White "A*01:01:0… "A*02:01:01" "B*08:01:01"
3 eJL164 COVID-19… 33 M White "A*02:01:0… "A*24:02:01" "B*15:01:01"
4 eHO141 COVID-19… NA <NA> <NA> "" "" ""
5 eHH173 Healthy … 50 M White "A*02:01" "A*03:01" "B*35:01"
6 eLH53 COVID-19… 42 M White "A*01:01:0… "A*11:01:01" "B*55:01:01"
7 eQD135 COVID-19… 74 M <NA> "A*02:01:0… "A*24:02:01" "B*07:02:01"
8 eLH49 COVID-19… 76 M <NA> "A*03:01:0… "A*29:02:01" "B*07:02:01"
9 eHO131 COVID-19… 58 F <NA> "A*02:01:0… "A*02:01:01" "B*15:01:01"
10 eHH174 Healthy … 31 F White "A*01:01" "A*02:01" "B*08:01"
# ℹ 19 more variables: `HLA-B...12` <chr>, `HLA-C...13` <chr>,
# `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
# DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
# DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
# DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,
# DRB5...30 <chr>
Now, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1
and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:
NB!: This may look a bit odd initially, but if you render your document, you should be all good!
library("table1") # <= Yes, this should normally go at the beginning!
|>
meta_data mutate(Gender = factor(Gender),
Cohort = factor(Cohort)) |>
table1(x = formula(~ Gender + Age + Race | Cohort),
data = _)
COVID-19-Acute (N=4) |
COVID-19-B-Non-Acute (N=8) |
COVID-19-Convalescent (N=90) |
COVID-19-Exposed (N=3) |
Healthy (No known exposure) (N=39) |
Overall (N=144) |
|
---|---|---|---|---|---|---|
Gender | ||||||
F | 1 (25.0%) | 4 (50.0%) | 33 (36.7%) | 1 (33.3%) | 17 (43.6%) | 56 (38.9%) |
M | 2 (50.0%) | 3 (37.5%) | 36 (40.0%) | 0 (0%) | 21 (53.8%) | 62 (43.1%) |
Missing | 1 (25.0%) | 1 (12.5%) | 21 (23.3%) | 2 (66.7%) | 1 (2.6%) | 26 (18.1%) |
Age | ||||||
Mean (SD) | 50.7 (17.0) | 43.7 (7.74) | 51.5 (15.3) | 35.0 (NA) | 33.3 (9.93) | 44.9 (15.7) |
Median [Min, Max] | 52.0 [33.0, 67.0] | 42.0 [33.0, 53.0] | 53.0 [21.0, 79.0] | 35.0 [35.0, 35.0] | 31.0 [21.0, 62.0] | 42.0 [21.0, 79.0] |
Missing | 1 (25.0%) | 1 (12.5%) | 21 (23.3%) | 2 (66.7%) | 0 (0%) | 25 (17.4%) |
Race | ||||||
African American | 1 (25.0%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (2.6%) | 2 (1.4%) |
White | 2 (50.0%) | 7 (87.5%) | 13 (14.4%) | 0 (0%) | 28 (71.8%) | 50 (34.7%) |
Asian | 0 (0%) | 0 (0%) | 3 (3.3%) | 0 (0%) | 2 (5.1%) | 5 (3.5%) |
Hispanic or Latino/a | 0 (0%) | 0 (0%) | 1 (1.1%) | 0 (0%) | 0 (0%) | 1 (0.7%) |
Native Hawaiian or Other Pacific Islander | 0 (0%) | 0 (0%) | 0 (0%) | 1 (33.3%) | 0 (0%) | 1 (0.7%) |
Black or African American | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 3 (7.7%) | 3 (2.1%) |
Mixed Race | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (2.6%) | 1 (0.7%) |
Missing | 1 (25.0%) | 1 (12.5%) | 73 (81.1%) | 2 (66.7%) | 4 (10.3%) | 81 (56.3%) |
Note how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!
Lastly, before we proceed, the meta_data
contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A
, HLA-B
and HLA-C
, so make sure to remove any non-class I, i.e. the one after, denoted D
-something.
- T4: Create a new version of the
meta_data
, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g.HLA-A...9
becomesA1
oandHLA-A...10
becomesA2
and so on forB1
,B2
,C1
andC2
(Think: How can werename
variables? And here, just do it “manually” per variable). Remember to assign this new data to the samemeta_data
-variable
Click here for hint
Whichtidyverse
function subsets variables? Perhaps there is a function, which somehow matches
a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)
Before we proceed, this is the data we will carry on with:
|>
meta_data sample_n(10)
# A tibble: 10 × 11
Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eJL162 COVID-19-C… 61 M <NA> "A*0… "A*0… "B*4… "B*5… "C*0… "C*0…
2 eQD135 COVID-19-C… 74 M <NA> "A*0… "A*2… "B*0… "B*0… "C*0… "C*0…
3 ePD86 COVID-19-C… 58 M White "A*0… "A*2… "B*4… "B*5… "C*0… "C*1…
4 ePD100 COVID-19-C… 66 M <NA> "" "" "" "" "" ""
5 eHO125 COVID-19-C… 52 M <NA> "A*0… "A*0… "B*3… "B*4… "C*0… "C*0…
6 eMR25 COVID-19-C… 21 F <NA> "" "" "" "" "" ""
7 eMR26 COVID-19-C… 62 M <NA> "A*0… "A*0… "B*0… "B*5… "C*0… "C*0…
8 eQD116 COVID-19-C… 66 F <NA> "A*0… "A*1… "B*3… "B*3… "C*0… "C*0…
9 eEE243 Healthy (N… 32 F <NA> "A*0… "A*0… "B*2… "B*4… "C*0… "C*0…
10 eQD136 COVID-19-C… NA <NA> <NA> "A*0… "A*6… "B*0… "B*1… "C*0… "C*0…
Now, we have a beautiful tidy
-dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.
The Peptide Details Data
Let’s start with simply having a look see:
|>
peptide_data sample_n(10)
# A tibble: 10 × 7
`TCR BioIdentity` TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
<chr> <chr> <chr> <chr>
1 CASSLVQGALGDTQYF+TCRBV11-02… CCTGCAAAGCTTGAGGACTCG… eEE243 ORF1ab
2 CASSDRDRVGNEQFF+TCRBV10-02+… GAGTCAGCTACCCGCTCCCAG… eEE226 surface glyco…
3 CASSTGTGHSAEAFF+TCRBV19-01+… ACATCGGCCCAAAAGAACCCG… ePD83 ORF3a
4 CASSLSWGGPGYTF+TCRBV27-01+T… CTGGAGTCGCCCAGCCCCAAC… eAV91 surface glyco…
5 CSGTGGVQETQYF+TCRBV20-X+TCR… ACAGTGACCAGTGCCCATCCT… eXL27 ORF7b
6 CAWSVLVSEQYF+TCRBV30-01+TCR… CTGAGTTCTAAGAAGCTCCTT… eLH47 ORF7b
7 CSARDSTTEYEQYF+TCRBV20-X+TC… GTGACCAGTGCCCATCCTGAA… eOX52 ORF8
8 CASSQGAPVDTQYF+TCRBV27-01+T… CTGGAGTCGCCCAGCCCCAAC… eHO134 ORF1ab
9 CASSQVGGAEQFF+TCRBV04-02+TC… CACCTACACACCCTGCAGCCA… eOX52 ORF7a
10 CASSQDWSGTAYNEQFF+TCRBV04-0… CTGCAGCCAGAAGACTCAGCC… eOX43 ORF8
# ℹ abbreviated name: ¹`TCR Nucleotide Sequence`
# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,
# `End Index in Genome` <dbl>
- Q3: How many observations of how many variables are in the data?
This is a rather big data set, so let us start with two “tricks” to handle this, first:
- Write the data back into your
data
-folder, using the filenamepeptide-detail-ci.csv.gz
, note the appending of.gz
, which is automatically recognised and results in gz-compression - Now, check in your data folder, that you have two files
peptide-detail-ci.csv
andpeptide-detail-ci.csv.gz
, delete the former - Adjust your reading-the-data-code in the “Load Data”-section, to now read in the
peptide-detail-ci.csv.gz
-file
Click here for hint
Just as you canread
a file, you can of course also write
a file. Note the filetype we want to write here is csv
. If you in the console type e.g. readr::wr
and then hit the tab
-button, you will see the different functions for writing different filetypes
Then:
- T5: As before, let’s immediately subset the
peptide_data
to the variables of interest:TCR BioIdentity
,Experiment
andAmino Acids
. Remember to assign this new data to the samepeptide_data
-variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click theEnvironment
pane and see which variables you have?
Once again, before we proceed, this is the data we will carry on with:
|>
peptide_data sample_n(10)
# A tibble: 10 × 3
Experiment `TCR BioIdentity` `Amino Acids`
<chr> <chr> <chr>
1 eOX49 CASRERGLNTEAFF+TCRBV02-01+TCRBJ01-01 ITEEVGHTDLMAAY
2 eOX52 CAWSVDRPRNEKLFF+TCRBV30-01+TCRBJ01-04 DFLEYHDVR,EDFLEYHD…
3 eJL158 CASSLTDGVGQPQHF+TCRBV27-01+TCRBJ01-05 FLQSINFVR,FLQSINFV…
4 eOX43 CAWSGMVSQVYNSPLHF+TCRBV30-01+TCRBJ01-06 DFLEYHDVR,EDFLEYHD…
5 eJL161 CASSQEFGAGLQLETQYF+TCRBV03-01/03-02+TCRBJ02-05 HTTDPSFLGRY
6 eXL32 CASSSRAPLLNSPLHF+TCRBV12-X+TCRBJ01-06 TLIGDCATV
7 eOX52 CASSFLAGTNEQFF+TCRBV05-04+TCRBJ02-01 ELYSPIFLI,LYSPIFLI…
8 eEE224 CASSLRAGGTDTQYF+TCRBV05-01+TCRBJ02-03 KLSYGIATV
9 eOX52 CASSGGTGGNF+TCRBV12-03/12-04+TCRBJ02-04 LLDDFVEII,LLLDDFVEI
10 eEE226 CASRWNRLYEQYF+TCRBV02-01+TCRBJ02-07 RQLLFVVEV
Q4: Is this tidy data? Why/why not?
T6: See if you can find a way to create the below data, from the above
|>
peptide_data sample_n(size = 10)
# A tibble: 10 × 5
Experiment CDR3b V_gene J_gene `Amino Acids`
<chr> <chr> <chr> <chr> <chr>
1 eHO131 CASSLGPSGGVSSYNEQFF TCRBV13-01 TCRBJ02-01 FVCNLLLLFV,LLFVTVYSHL,T…
2 eHH173 CASRETYEQYF TCRBV02-01 TCRBJ02-07 YLDAYNMMI
3 eXL30 CASSEGKLYEQYF TCRBV06-01 TCRBJ02-07 AEAELAKNVSL,AELAKNVSLDN…
4 eHO130 CASSQVWEGYNEQFF TCRBV04-02 TCRBJ02-01 FLQSINFVR,FLQSINFVRI,FL…
5 eXL31 CASSTETSAAGGWRDTQYF TCRBV25-X TCRBJ02-03 MPASWVMRI
6 ePD85 CASSIGQGNTYEQYF TCRBV19-01 TCRBJ02-07 SEHDYQIGGYTEKW,YQIGGYTE…
7 eEE228 CASSPGLAGGGTYNEQFF TCRBV05-01 TCRBJ02-01 LEPLVDLPI
8 eEE240 CASSILPPETQYF TCRBV19-01 TCRBJ02-05 FLNGSCGSV
9 eXL27 CASSYFPGETALQLYEQYF TCRBV28-01 TCRBJ02-07 FVDGVPFVV
10 eAV88 CSARDGDSGTGELFF TCRBV20-X TCRBJ02-02 FLQSINFVR,FLQSINFVRI,FL…
Click here for hint
First: Compare the two datasets and identify what happened? Did any variables “dissappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function toseparate
a composit (untidy) col
umn into
a set of new variables based on a sep
arator? But what is a sep
arator? Just like when you read a file with C
omma S
eparated V
alues, a separator denotes how a composite string is divided into fields. So look for such a repeated values, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\x”, where x denotes the character needing to be “escaped”
- T7: Add a variable, which counts how many peptides are in each observation of
Amino Acids
Click here for hint
We have been working with thestringr
-package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_
and then hit the tab
-button to see relevant functions
|>
peptide_data sample_n(size = 10)
# A tibble: 10 × 6
Experiment CDR3b V_gene J_gene `Amino Acids` n_peptides
<chr> <chr> <chr> <chr> <chr> <dbl>
1 eOX52 CASSYSAGDPYNEQFF TCRBV06-06 TCRBJ0… FVDGVPFVV 1
2 eEE224 CASLGGYSYEQYF TCRBV02-01 TCRBJ0… FVCNLLLLFV,L… 3
3 eXL27 CSVATGVSGNTIYF TCRBV29-01 TCRBJ0… AFLLFLVLI,FL… 11
4 eXL31 CASSYRTGGNQPQHF TCRBV06-02/06-03 TCRBJ0… LSPRWYFYY,SP… 2
5 eXL30 CASSLANPHGYTF TCRBV12-03/12-04 TCRBJ0… AFLLFLVLI,FL… 11
6 eOX54 CASGAGVRETQYF TCRBV12-03/12-04 TCRBJ0… IQYIDIGNY 1
7 eXL30 CSARLTANTGELFF TCRBV20-X TCRBJ0… FVDGVPFVV 1
8 eQD112 CASSLRGYNEQFF TCRBV13-01 TCRBJ0… AFLLFLVLI,FL… 11
9 eHO134 CASSVGVGYEQYF TCRBV09-01 TCRBJ0… GTITSGWTF 1
10 ePD83 CASSMGQGARTEAFF TCRBV19-01 TCRBJ0… SEHDYQIGGYTE… 3
- T8: Re-create the following plot
Q4: What is the maximum number of peptides assigned to one observation?
T9: Using the
str_c
- and theseq
-functions, re-create the below
[1] "peptide_1" "peptide_2" "peptide_3" "peptide_4" "peptide_5"
Click here for hint
If you’re uncertain on how a function works, try going into the console and in this case e.g. typestr_c("a", "b")
and seq(from = 1, to = 3)
and see if you combine these?
- T10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data
Click here for hint
In the console, write?separate
and think about how you used it earlier. Perhaps you can not only specify a vector to separate into
, but also specify a function, which returns a vector?
|>
peptide_data sample_n(size = 10)
# A tibble: 10 × 18
Experiment CDR3b V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eOX52 CASSVVGNTEA… TCRBV… TCRBJ… FLPRVFSAV <NA> <NA> <NA>
2 eHO138 CASSHQSSYEQ… TCRBV… TCRBJ… AYKTFPPT… KTFPPTEPK <NA> <NA>
3 eQD125 CASSLGADSQE… TCRBV… TCRBJ… SYFTSDYY… VLHSYFTS… YFTSDYYQ… <NA>
4 eAV88 CASSYQGINEQ… TCRBV… TCRBJ… FVFKNIDGY <NA> <NA> <NA>
5 eXL31 CASSPGAGTRD… TCRBV… TCRBJ… YIFFASFYY <NA> <NA> <NA>
6 eEE240 CASSAAQPGQN… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
7 eEE228 CASSHTNEQFF TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF
8 eOX46 CASSALAGRNT… TCRBV… TCRBJ… VLWAHGFEL <NA> <NA> <NA>
9 eOX52 CASSLSYNEQFF TCRBV… TCRBJ… FGEVFNAT… FNATRFAS… GEVFNATRF NATRFASVY
10 eEE224 CASSLYPLDQP… TCRBV… TCRBJ… FLQSINFVR FLQSINFV… FLYLYALV… GLEAPFLY…
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
# peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
# peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>
Q5: Now, presumable you got a warning, discuss in your group why that is?
Q6: With respect to
peptide_n
, discuss in your group, if this is wide- or long-data?
Now, finally we will use the what we prepared for today, data-pivotting. There are two functions, namely pivot_wider()
and pivot_longer()
. Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the sample_n()
-function several times above and we can use that to randomly sample n
-observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n
gradually to see if everything continues to work as anticipated.
T11: Using the
peptide_data
, run a fewsample_n()
-calls with varying degree ofn
to make sure, that you get a feeling for what is going onT12: From the
peptide_data
data above, with peptide_1, peptide_2, etc. create this data set using one of the data-pivotting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse thepeptide_data
-variable as we don’t want huge redundant data sets floating around in our environment
Click here for hint
If the pivotting is not clear at all, then do what I do, create some example data:
<- tibble(
my_data id = str_c("id_", 1:10),
var_1 = round(rnorm(10),1),
var_2 = round(rnorm(10),1),
var_3 = round(rnorm(10),1))
…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using the pivot_wider()
-/pivot_longer()
-functions. Use the View()
-function to inspect and get a better overview of the results of pivotting.
|>
peptide_data sample_n(10)
# A tibble: 10 × 7
Experiment CDR3b V_gene J_gene n_peptides peptide_n peptide
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 eOX52 CASSLAGPGELFF TCRBV07-06 TCRBJ… 1 peptide_… <NA>
2 eEE240 CASSLALGGDNYGYTF TCRBV12-X TCRBJ… 1 peptide_… <NA>
3 eEE226 CASSQDPLGGGASYEQYF TCRBV04-01 TCRBJ… 6 peptide_6 VLPFND…
4 eEE240 CASSIDASMNTEAFF TCRBV19-01 TCRBJ… 3 peptide_9 <NA>
5 eMR14 CASSLNQAEAFF TCRBV28-01 TCRBJ… 2 peptide_… <NA>
6 eQD114 CASSDRAGTDTQYF TCRBV27-01 TCRBJ… 1 peptide_7 <NA>
7 eOX43 CASSLGGTGNTIYF TCRBV28-01 TCRBJ… 13 peptide_8 QSINFV…
8 eOX52 CASSLPHATNEKLFF TCRBV11-03 TCRBJ… 1 peptide_4 <NA>
9 eEE226 CASSFGGNEQFF TCRBV12-03… TCRBJ… 1 peptide_… <NA>
10 eOX54 CATSKQRAGGNGYTF TCRBV15-01 TCRBJ… 7 peptide_… <NA>
Q7: You will see some
NA
s in thepeptide
-variable, discuss in your group from where these arise?Q8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?
T13: Now, loose the redundant variables
n_peptides
andpeptide_n
and also get rid of theNA
s in thepeptide
-column and make sure, that we only have unique observations, i.e. there are no repeated rows/observations
|>
peptide_data sample_n(10)
# A tibble: 10 × 5
Experiment CDR3b V_gene J_gene peptide
<chr> <chr> <chr> <chr> <chr>
1 eEE240 CASSLGPTYEQYF TCRBV11-03 TCRBJ02-07 IDFYLCFLAF
2 eEE226 CASSTDPNRDLNTEAFF TCRBV19-01 TCRBJ01-01 VLSFCAFAV
3 eAV88 CASSRLAGVREQYF TCRBV05-06 TCRBJ02-07 FVDGVPFVV
4 eEE228 CASSEFYPGQGYTGELFF TCRBV25-01 TCRBJ02-02 LIVNSVLLFL
5 eXL27 CASSLGTPTYNEQFF TCRBV07-09 TCRBJ02-01 LLFLVLIML
6 eEE228 CASSLLSGNTEAFF TCRBV27-01 TCRBJ01-01 WLLWPVTLA
7 eLH42 CASSLGASMNTEAFF TCRBV13-01 TCRBJ01-01 AFLLFLVLI
8 eOX49 CSAPTGTTYEQYF TCRBV20-X TCRBJ02-07 SLIDFYLCFL
9 eLH41 CASSLAGLAADTQYF TCRBV27-01 TCRBJ02-03 LQSINFVRI
10 eXL37 CASSLETVDPYEQYF TCRBV07-09 TCRBJ02-07 YLCFLAFLL
- Q8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?
Again, we turn to the stringr
-package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV
.
- T14: Use the
str_detect()
-function tofilter
theCDR3b
andpeptide
variables using apattern
of[^ARNDCQEGHILKMFPSTWYV]
and then play with thenegate
-parameter so see what happens
Click here for hint
Again, try to play a bit around with the function in the console, type e.g.str_detect(string = "ARND", pattern = "A")
and str_detect(string = "ARND", pattern = "C")
and then recall, that the filter
-function requires a logical vector, i.e. a vector of TRUE
and FALSE
to filter the rows
- T15: Add two new variables to the data,
k_CDR3b
andk_peptide
each signifying the length of the respective sequences
Click here for hint
Again, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?|>
peptide_data sample_n(10)
# A tibble: 10 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eLH47 CASSLSPGESLSSYNSPLHF TCRBV07-08 TCRBJ… FLYLYA… 20 10
2 eXL27 CASSPGQGAYSEQFF TCRBV02-01 TCRBJ… VQELYS… 15 10
3 eOX49 CSATLAGGQETQYF TCRBV20-X TCRBJ… FYLCFL… 14 9
4 eOX43 CASSFHGEQYF TCRBV12-03/… TCRBJ… NVFAFP… 11 10
5 eXL27 CASSQQGGTGELFF TCRBV14-01 TCRBJ… FVCNLL… 14 10
6 eOX43 CATSRVPSGRASYNEQFF TCRBV15-01 TCRBJ… KEIIFL… 18 11
7 eOX54 CASSPVRQNSYEQYF TCRBV07-09 TCRBJ… LLLDDF… 15 9
8 eOX52 CASSQYPGAGLDEQYF TCRBV14-01 TCRBJ… FVDGVP… 16 9
9 eEE226 CASSSQPIQLYEQYF TCRBV27-01 TCRBJ… IDFYLC… 15 10
10 eOX46 CASSYSLAGGTYEQYF TCRBV06-05 TCRBJ… FLAFLL… 16 9
- T16: Re-create this plot
Q9: What is the most predominant length of the CDR3b-sequences?
T17: Re-create this plot
Q10: What is the most predominant length of the peptide-sequences?
Q11: Discuss in your group, if this data set is tidy or not?
|>
peptide_data sample_n(10)
# A tibble: 10 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eMR14 CASSQDAEGRGLAKNIQYF TCRBV04-03 TCRBJ02-… QSINFV… 19 9
2 eLH48 CASSIIQGSYNSPLHF TCRBV07-02 TCRBJ01-… TTLPKG… 16 9
3 eOX49 CASSVTGGTNEKLFF TCRBV07-03 TCRBJ01-… MIELSL… 15 10
4 eOX49 CSVEDLLQGNYGYTF TCRBV29-01 TCRBJ01-… FLAFLL… 15 9
5 eOX46 CASGDLAGGPNNEQFF TCRBV12-05 TCRBJ02-… TLACFV… 16 10
6 eEE240 CASTHDWEDTQYF TCRBV06-X TCRBJ02-… NVFAFP… 13 9
7 eOX52 CASRTTSGGYEQYF TCRBV27-01 TCRBJ02-… LYSPIF… 14 9
8 eOX46 CASSYRSSRGRPEAFF TCRBV28-01 TCRBJ01-… SLIDFY… 16 10
9 eAV88 CASSQTGDRLYEQYF TCRBV04-01 TCRBJ02-… VLNDIL… 15 9
10 eOX46 CASSFPTAGTNNGEQFF TCRBV07-08 TCRBJ02-… FLAFLL… 17 9
Creating one data set from two data sets
Before we move onto using the family of *_join
-functions you prepared for today, we will just take a quick peek at the meta data again:
|>
meta_data sample_n(10)
# A tibble: 10 × 11
Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eQD116 COVID-19-C… 66 F <NA> "A*0… "A*1… "B*3… "B*3… "C*0… "C*0…
2 eQD138 COVID-19-C… NA <NA> <NA> "A*0… "A*0… "B*3… "B*4… "C*0… "C*1…
3 eNL192 COVID-19-C… NA <NA> <NA> "" "" "" "" "" ""
4 eQD113 COVID-19-C… 36 M <NA> "A*0… "A*1… "B*5… "B*5… "C*0… "C*1…
5 eQD117 COVID-19-C… 70 F <NA> "A*0… "A*2… "B*3… "B*4… "C*0… "C*0…
6 eHO140 COVID-19-C… NA <NA> <NA> "" "" "" "" "" ""
7 ePD83 Healthy (N… 29 F Asian "A*0… "A*0… "B*1… "B*4… "C*0… "C*0…
8 eQD124 COVID-19-B… 40 F White "A*0… "A*0… "B*1… "B*5… "C*0… "C*0…
9 eDH105 COVID-19-C… 32 F <NA> "A*2… "A*2… "B*4… "B*4… "C*0… "C*0…
10 eLH47 COVID-19-C… 35 F White "A*0… "A*0… "B*0… "B*0… "C*0… "C*0…
Remember you can scroll in the data.
- Q12: Discuss in your group, if this data with respect to the
A1
-,A2
-,B1
-,B2
-,C1-
andC2
-variables is a wide- or a long-data format?
As with the peptide_data
, we will now have to use data-pivotting again. I.e.:
- T18: use either the
pivot_wider
- orpivot_longer
-function to create the following data:
|>
meta_data sample_n(10)
# A tibble: 10 × 7
Experiment Cohort Age Gender Race Gene Allele
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 ePD100 COVID-19-Convalescent 66 M <NA> B1 ""
2 eEE224 Healthy (No known exposure) 24 M White B2 "B*40:01"
3 eQD131 COVID-19-Exposed NA <NA> <NA> A2 "A*32:01:01"
4 eLH42 COVID-19-Convalescent 63 M <NA> B1 "B*07:02:01"
5 eOX49 Healthy (No known exposure) 21 M White A2 "A*26:01"
6 eNL192 COVID-19-Convalescent NA <NA> <NA> C2 ""
7 eQD112 COVID-19-Convalescent 65 M <NA> C2 "C*07:02:01"
8 eMR20 COVID-19-B-Non-Acute 37 M White A2 "A*26:01:01"
9 eQD108 COVID-19-Convalescent NA <NA> <NA> A2 "A*68:01:02"
10 eJL146 Healthy (No known exposure) 30 M White C2 "C*08:02"
Remember, what we are aiming for here, is to create one data set from two. So:
- Q13: Discuss in your group, which variable(s?) define the same observations between the
peptide_data
and themeta_data
?
Once you have agreed upon Experiment
, then use that knowledge to subset the meta_data
to the variables-of-interest:
|>
meta_data sample_n(10)
# A tibble: 10 × 2
Experiment Allele
<chr> <chr>
1 eMR14 "C*07:02:01"
2 eAV105 "C*07:02:01"
3 eDH113 "C*07:01"
4 eLH48 "A*03:01:01"
5 eEE226 "C*04:01"
6 ePD91 ""
7 eAV91 "B*51:01"
8 ePD90 ""
9 eQD139 "B*56:01:01"
10 eHO126 "C*07:02:01"
Use the View()
-function again, to look at the meta_data
- Notice something? Some alleles are e.g. A*11:01
, whereas others are B*51:01:02
. You can find information on why, by visiting Nomenclature for Factors of the HLA System.
Long story short, we only want to include Field 1
(allele group) and Field 2
(Specific HLA protein). You have prepared the stringr
-package for today. See if you can find a way to reduce e.g. B*51:01:02
to B*51:01
and then create a new variable Allele_F_1_2
accordingly, while also removing the ...x
(where x
is a number) subscripts from the Gene
-variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NA
s and ""
s, denoting empty entries.
Click here for hint
There are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps thestringr
-package contains a function work with substring? In the console, type stringr::
and hit tab
. This will display the functions available in the stringr
-package. Scroll down and find the functionst starting with str_
and look for on, which might be relevant and remember you can use ?function_name
to get more information on how a given function works.
- T19: Create the following data, according to specifications above:
|>
meta_data sample_n(10)
# A tibble: 10 × 3
Experiment Allele Allele_F_1_2
<chr> <chr> <chr>
1 eJL158 B*40:02:01 B*40:02
2 eAM23 A*11:01:01 A*11:01
3 eJL160 C*05:01:01 C*05:01
4 eJL154 A*02:01:01 A*02:01
5 eMR15 A*03:01:01 A*03:01
6 eHO133 C*12:03:01 C*12:03
7 eQD139 C*01:02:01 C*01:02
8 eLH59 C*03:04:01 C*03:04
9 eQD138 A*02:01:01 A*02:01
10 eMR14 C*07:02:01 C*07:02
The asterix, i.e. *
is a rather annoying character because of ambiguity, so:
- T20: Clean the data a bit more, by removing the asterix and redundant variables:
|>
meta_data sample_n(size = 10)
# A tibble: 10 × 2
Experiment Allele
<chr> <chr>
1 eDH113 C16:01
2 eLH51 A24:07
3 eJL160 A01:01
4 eJL153 B07:02
5 eQD121 C07:01
6 eJL158 A02:01
7 eEE243 C04:01
8 eQD118 B51:01
9 eJL154 A02:01
10 eOX43 C07:04
Click here for hint 1
Again, thestringr
-package may come in handy. Perhaps there is a function remove
, one or more such pesky characters?
Click here for hint 2
Getting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…Recall the peptide_data
?
|>
peptide_data sample_n(10)
# A tibble: 10 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eXL30 CASSLALLDNEQFF TCRBV07-06 TCRBJ… YLCFLA… 14 9
2 eEE226 CASSYTDRSGYTF TCRBV06-02/06-… TCRBJ… LEYHDV… 13 10
3 eXL31 CASSIAGNTYEQYF TCRBV19-01 TCRBJ… WICLLQ… 14 9
4 eAV88 CSVVYRDGYTF TCRBV29-01 TCRBJ… FLWLLW… 11 10
5 eEE228 CATSDLPGGFNTGELFF TCRBV24-01 TCRBJ… LNDLCF… 17 10
6 eMR20 CASSQGGSLQPQHF TCRBV16-01 TCRBJ… VLHSYF… 14 10
7 eEE228 CASSQAVNTEAFF TCRBV04-03 TCRBJ… IELSLI… 13 10
8 eOX43 CASSQEGGLAGVHEQYF TCRBV03-01/03-… TCRBJ… LIDFYL… 17 9
9 eXL27 CASSEGGALGHF TCRBV02-01 TCRBJ… INVFAF… 12 10
10 eOX46 CASSAGTSSYEQYF TCRBV07-03 TCRBJ… LLFLVL… 14 9
- T21: Create a
dplyr
-pipeline, starting with thepeptide_data
, which joins it with themeta_data
and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable namespeptide_meta_data
(If you get a warning, discuss in your group what it means?)
Click here for hint 1
Which family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set
Click here for hint 2
You should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting thepeptide_data
prior to joining to e.g. 100,000 or 10,000 rows.
|>
peptide_meta_data sample_n(10)
# A tibble: 10 × 8
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide Allele
<chr> <chr> <chr> <chr> <chr> <int> <int> <chr>
1 eMR14 CASSPGQGSSPLHF TCRBV0… TCRBJ… FLYLYA… 14 10 A03:01
2 eEE228 CSARPLGPTSGRIYEQYF TCRBV2… TCRBJ… GYINVF… 18 10 B35:03
3 eOX52 CASLAGNSYEQYF TCRBV0… TCRBJ… FVCNLL… 13 10 B15:17
4 eOX46 CASSQAGTGTGVYEQYF TCRBV0… TCRBJ… LLTDEM… 17 10 A02:01
5 eOX43 CATRLGPYYEQYF TCRBV2… TCRBJ… IDFYLC… 13 10 C03:04
6 eEE240 CASSPFLLDTQYF TCRBV0… TCRBJ… AFLLFL… 13 9 C03:04
7 eJL161 CASLAGTSAWETQYF TCRBV1… TCRBJ… YLYALV… 15 9 A02:01
8 eQD112 CASSLVGSNYGYTF TCRBV0… TCRBJ… NGVEGF… 14 9 C04:01
9 eEE226 CASSLGREQFF TCRBV0… TCRBJ… APKEII… 11 8 C07:02
10 eOX52 CASSLLSPGGYNEQFF TCRBV2… TCRBJ… TVLSFC… 16 9 B40:01
Analysis
Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:
- What characterises the peptides binding to the HLAs?
- What characterises T-cell Receptors binding to the pMHC-complexes?
Peptides binding to HLA
As we have touched upon multiple times, R
is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo
(More info here).
- T22: Subset the final
peptide_meta_data
-data toA02:01
and unique observations of peptides of length 9 and re-create the below sequence logo
Click here for hint
You can pipe a vector of peptides into ggseqlogo, but perhaps you first need topull
that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9
- T23: Repeat for e.g.
B07:02
or another of your favorite alleles
Now, let’s take a closer look at the sequence logo:
- Q14: Which positions in the peptide determines binding to HLA?
Click here for hint
Recall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?CDR3b-sequences binding to pMHC
- T24: Subset the
peptide_meta_data
, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:
Q15: In your group, discuss what you see?
T25: Play around with other combinations of
k_CDR3b
,Allele
, andpeptide
and inspect how the logo changes
Disclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here.
Epilogue
That’s it for today - I know this overwhelming now, but commit to it and you WILL be plenty rewarded! I hope today was at least a glimpse into the flexibility and capabilities of using tidyverse
for applied Bio Data Science
…also, noticed something? We spend maybe 80% of the time here on dealing with data-wrangling and then once we’re good to go, the analysis wasn’t that time consuming - That’s often the way it ends up going, you’ll spend a lot of time on data handling and getting the tidyverse toolbox in your toolbelt, will allow you to be so much more effecient in your data wrangling, so you can get to the fun part as quick as possible!
Today’s Assignment
After today, we are halfway through the labs of the course, so now is a good time to spend some time recalling what we have been over and practising writing a reproducible Quarto-report.
Your group assignment today is to condense the exercises into a group micro-report! Talk together and figure out how to destill the exercises from today into one small end-to-end runable reproducible micro-report. DO NOT include ALL of the exercises, but rather include as few steps as possible to arrive at your results. Be vey consise!
But WHY? WHY are you not specifying exactly what we need to hand in? Because we are training taking independent decisions, which is crucial in applied bio data science, so take a look at the combined group code, select relevant sections and condense - If you don’t make it all the way through the exercises, then condense and present what you were able to arrive at! What do you think is central/important/indispensable? Also, these hand ins are NOT for us to evaluate you, but for you to train creating products and the get feedback on your progress!
IMPORTANT: Remember to check the ASSIGNMENT GUIDELINES
…and as always - Have fun!