Lab 5: Data Wrangling II

Published

2026

Package(s)

Schedule

Learning Materials

Please prepare the following materials

Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book

Learning Objectives

A student who has met the objectives of the session will be able to:

  • Understand and apply the various str_*() functions for string manipulation
  • Understand and apply the family of *_join() functions for combining data sets
  • Understand and apply pivot_wider() and pivot_longer()
  • Use factors in context with plotting categorical data using ggplot

Exercises

Prologue

Today will not be easy! But please try to remember Hadley’s words of advice:

  • “The bad news is, whenever you’re learning a new tool, for a long time, you’re going to suck! It’s gonna be very frustrating! But the good news is that that is typical and something that happens to everyone and it’s only temporary! Unfortunately, there is no way to going from knowing nothing about the subject to knowing something about a subject and being an expert in it without going through a period of great frustration and much suckiness! Keep pushing through!” - H. Wickham (dplyr tutorial at useR 2014, 4:10 - 4:48)

Intro

We are upping the game here, so expect to get stuck at some of the questions. Remember - Discuss with your group how to solve the task, revisit the materials you prepared for today and naturally, the TAs and I are happy to nudge you in the right direction. Finally, remember… Have fun!

Remember what you have worked on so far:

  • RStudio
  • Quarto
  • ggplot
  • filter
  • arrange
  • select
  • mutate
  • group_by
  • summarise
  • The pipe and creating pipelines
  • stringr
  • joining data
  • pivoting data

That’s quite a lot! Well done - You’ve come quite far already! Remember to think about the above tools in the following as we will synthesise your learnings so far into an analysis!

Background

In the early 20s, the world was hit by the coronavirus disease 2019 (COVID-19) pandemic. The pandemic was caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In Denmark, the virus first confirmed case was on 27 February 2020.

While initially very little was known about the SARS-CoV-2 virus, we did know the general pathology of vira. Briefly, the virus invades the cells and hijacks the intra-cellular machinery. Using the hijacked machinery, components for new virus particles are produced, eventually being packed into the viral envelope and released from the infected cell. Some of these components, viral proteins, is broken down into smaller fragments called peptides by the proteasome. These peptides are transported into the endoplasmic reticulum by the Transporter Associated with antigen Processing (TAP) protein complex. Here, they are aided by chaperones bound to the Major Histocompatilibty Complex class I (MHC-I) and then across the Golgi apparatus they finally get displayed on the surface of the cells. Note, in humans, MHC is also called Human Leukocyte Antigen (HLA) and represents the most diverse genes. Each of us have a total of 6 HLA-alleles, 3 from the maternal and 3 from the paternal side. These are further divided into 3 classes HLA-A, HLA-B and HLA-C and the combination of these constitute the HLA-haplotype for an individual. Once the peptide is bound to the MHC class I at the cell surface and exposed, the MHC-I peptide complex can be recognised by CD8+ Cytotoxic T-Lymphocytes (CTLs) via the T-cell Receptor (TCR). If a cell displays peptides of viral origin, the CTL gets activated and via a cascade induces apoptosis (programmed cell death) of the infected cell. The process is summarised in the figure below (McCarthy and Weinberg 2015).

The data we will be working with today contains data on sequenced T-cell receptors, viral antigens, HLA-haplotypes and clinical meta data for a cohort:

  • “A large-scale database of T-cell receptor beta (TCR\(\beta\)) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2” (Nolan et al. 2020).

Your Task Today

Today, we will emulate the situation, where you are working as a Bioinformatician / Bio Data Scientist and you have been given the data and the task of answering these two burning questions:

  1. What characterises the peptides binding to the HLAs?
  2. What characterises T-cell Receptors binding to the pMHC-complexes?

GROUP ASSIGNMENT: Today, your assignment will be to create a micro-report on these 2 questions! (Important, see: how to)

MAKE SURE TO READ THE LAST SECTION ON THE ASSIGNMENT

Getting Started

First, make sure to read and discuss the feedback you got from last week’s assignment!

  1. Then, once again go to the R for Bio Data Science RStudio Cloud Server
  2. Make sure you are in your r_for_bio_data_science project, you can verify this in the upper right corner
  3. In the same place as your r_for_bio_data_science.Rproj file and existing data folder, create a new folder and name it doc
  4. Go to the aforementioned manuscript. Download the PDF and upload it to your new doc folder
  5. Open the PDF and find the link to the data
  6. Go to the data site (Note, you may have to create and account to download, shouldn’t take too long) . Find and download the file ImmuneCODE-MIRA-Release002.1.zip (CAREFUL, do not download the superseded files)
  7. Unpack the downloaded file
  8. Find the files peptide-detail-ci.csv and subject-metadata.csv and compress to .zip files
  9. Upload the compressed peptide-detail-ci.csv.zip and subject-metadata.csv.zip files to your data folder in your RStudio Cloud session
  10. Finally, once again, create a new Quarto document for today’s exercises, containing the sections:
    1. Background
    2. Aim
    3. Load Libraries
    4. Load Data
    5. Data Description
    6. Analysis

Creating the Micro-Report

Background

Feel free to copy paste the one stated in the background-section above

Aim

State the aim of the micro-report, i.e. what are the questions you are addressing?

Load Libraries

Load the libraries needed

Load Data

Read the two data sets into variables peptide_data and meta_data.

Click here for hint

Think about which Tidyverse package deals with reading data and what are the file types we want to read here?

Data Description

It is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview

The Subject Meta Data

Let’s take a look at the meta data:

meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 30
   Experiment Subject `Cell Type` `Target Type` Cohort          Age Gender Race 
   <chr>        <dbl> <chr>       <chr>         <chr>         <dbl> <chr>  <chr>
 1 eOX43        19830 naive_CD8   C19_cI        Healthy (No …    24 M      White
 2 eLH51         1500 PBMC        C19_cI        COVID-19-Con…    55 M      Asian
 3 eLH44         6501 PBMC        C19_cI        COVID-19-Con…    61 F      <NA> 
 4 eLH50         7954 PBMC        C19_cI        COVID-19-Con…    28 M      <NA> 
 5 eLH47          383 PBMC        C19_cI        COVID-19-Con…    35 F      White
 6 eJL162        6890 PBMC        C19_cI        COVID-19-Con…    61 M      <NA> 
 7 eQD119        5314 PBMC        C19_cI        COVID-19-Con…    51 M      <NA> 
 8 eMR13         2059 PBMC        C19_cI        COVID-19-Con…    NA <NA>   <NA> 
 9 eMR14         2845 PBMC        C19_cI        COVID-19-Con…    NA <NA>   <NA> 
10 eLH46          359 PBMC        C19_cI        COVID-19-Con…    57 F      White
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,
#   DRB5...30 <chr>
  • Q1: How many observations of how many variables are in the data?

  • Q2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?

  • T1: Re-create this plot

Read this first:

  • Think about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting stratified by Cohort and Gender. Recall, that we can stick together a dplyr pipeline with a call to ggplot.

Does your plot look different somehow? Consider peeking at the hint…

Click here for hint

Perhaps not everyone agrees on how to denote NAs in data. I have seen -99, -11, _ and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read_csv() function. Recall, how can we get information on the parameters of a ?function
  • T2: Re-create this plot

Click here for hint

Perhaps there is a function, which can cut continuous observations into a set of bins?
STOP! Make sure you handled how NAs are denoted in the data before proceeding, see hint below T1
  • T3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type
meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 27
   Experiment Cohort      Age Gender Race  `HLA-A...9` `HLA-A...10` `HLA-B...11`
   <chr>      <chr>     <dbl> <chr>  <chr> <chr>       <chr>        <chr>       
 1 eQD131     COVID-19…    NA <NA>   <NA>  "A*02:01:0… "A*32:01:01" "B*15:01:01"
 2 eHO124     Healthy …    62 M      <NA>  "A*02:01"   "A*03:01"    "B*07:02"   
 3 eOX54      Healthy …    39 F      Afri… "A*02:01"   "A*23:17"    "B*15:03"   
 4 eAV100     COVID-19…    29 F      <NA>  "A*02:01:0… "A*68:01:02" "B*07:02:01"
 5 eQD135     COVID-19…    74 M      <NA>  "A*02:01:0… "A*24:02:01" "B*07:02:01"
 6 ePD79      COVID-19…    79 F      <NA>  "A*02:01:0… "A*02:02:01" "B*07:02:01"
 7 eHO134     COVID-19…    36 M      White "A*01:01:0… "A*24:02:01" "B*49:01:01"
 8 ePD100     COVID-19…    66 M      <NA>  ""          ""           ""          
 9 eEE224     Healthy …    24 M      White "A*02:01"   "A*03:01"    "B*27:05"   
10 eJL154     COVID-19…    35 F      Nati… "A*02:01:0… "A*29:02:01" "B*15:02:01"
# ℹ 19 more variables: `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,
#   DRB5...30 <chr>

Now, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1 and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:

NB!: This may look a bit odd initially, but if you render your document, you should be all good!

library("table1") # <= Yes, this should normally go at the beginning!
meta_data |>
  mutate(Gender = factor(Gender),
         Cohort = factor(Cohort)) |>
  table1(x = formula(~ Gender + Age + Race | Cohort),
         data = _)
COVID-19-Acute
(N=4)
COVID-19-B-Non-Acute
(N=8)
COVID-19-Convalescent
(N=90)
COVID-19-Exposed
(N=3)
Healthy (No known exposure)
(N=39)
Overall
(N=144)
Gender
F 1 (25.0%) 4 (50.0%) 33 (36.7%) 1 (33.3%) 17 (43.6%) 56 (38.9%)
M 2 (50.0%) 3 (37.5%) 36 (40.0%) 0 (0%) 21 (53.8%) 62 (43.1%)
Missing 1 (25.0%) 1 (12.5%) 21 (23.3%) 2 (66.7%) 1 (2.6%) 26 (18.1%)
Age
Mean (SD) 50.7 (17.0) 43.7 (7.74) 51.5 (15.3) 35.0 (NA) 33.3 (9.93) 44.9 (15.7)
Median [Min, Max] 52.0 [33.0, 67.0] 42.0 [33.0, 53.0] 53.0 [21.0, 79.0] 35.0 [35.0, 35.0] 31.0 [21.0, 62.0] 42.0 [21.0, 79.0]
Missing 1 (25.0%) 1 (12.5%) 21 (23.3%) 2 (66.7%) 0 (0%) 25 (17.4%)
Race
African American 1 (25.0%) 0 (0%) 0 (0%) 0 (0%) 1 (2.6%) 2 (1.4%)
White 2 (50.0%) 7 (87.5%) 13 (14.4%) 0 (0%) 28 (71.8%) 50 (34.7%)
Asian 0 (0%) 0 (0%) 3 (3.3%) 0 (0%) 2 (5.1%) 5 (3.5%)
Hispanic or Latino/a 0 (0%) 0 (0%) 1 (1.1%) 0 (0%) 0 (0%) 1 (0.7%)
Native Hawaiian or Other Pacific Islander 0 (0%) 0 (0%) 0 (0%) 1 (33.3%) 0 (0%) 1 (0.7%)
Black or African American 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3 (7.7%) 3 (2.1%)
Mixed Race 0 (0%) 0 (0%) 0 (0%) 0 (0%) 1 (2.6%) 1 (0.7%)
Missing 1 (25.0%) 1 (12.5%) 73 (81.1%) 2 (66.7%) 4 (10.3%) 81 (56.3%)

Note how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!

Lastly, before we proceed, the meta_data contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A, HLA-B and HLA-C, so make sure to remove any non-class I, i.e. the one after, denoted D-something.

  • T4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2 (Think: How can we rename variables? And here, just do it “manually” per variable). Remember to assign this new data to the same meta_data variable

Click here for hint

Which tidyverse function subsets variables? Perhaps there is a function, which somehow matches a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)

Before we proceed, this is the data we will carry on with:

meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 11
   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 eQD137     COVID-19-C…    NA <NA>   <NA>  "A*0… "A*0… "B*3… "B*4… "C*0… "C*1…
 2 eMR22      COVID-19-C…    65 M      <NA>  "A*0… "A*3… "B*4… "B*5… "C*0… "C*1…
 3 eQD139     COVID-19-C…    NA <NA>   <NA>  "A*0… "A*2… "B*5… "B*5… "C*0… "C*0…
 4 eAV93      Healthy (N…    41 M      White "A*1… "A*6… "B*3… "B*3… "C*0… "C*0…
 5 eDH105     COVID-19-C…    32 F      <NA>  "A*2… "A*2… "B*4… "B*4… "C*0… "C*0…
 6 eMR21      COVID-19-B…    53 F      White "A*2… "A*2… "B*0… "B*4… "C*0… "C*1…
 7 eDH107     COVID-19-C…    72 F      <NA>  "A*0… "A*0… "B*1… "B*3… "C*0… "C*0…
 8 eQD116     COVID-19-C…    66 F      <NA>  "A*0… "A*1… "B*3… "B*3… "C*0… "C*0…
 9 eHO138     COVID-19-B…    NA <NA>   <NA>  ""    ""    ""    ""    ""    ""   
10 eQD121     COVID-19-C…    38 M      <NA>  "A*0… "A*2… "B*1… "B*5… "C*0… "C*0…

Now, we have a beautiful tidy dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.

The Peptide Details Data

Let’s start with simply having a look see:

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 7
   `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
   <chr>                        <chr>                  <chr>      <chr>         
 1 CSVIRQGAEQYF+TCRBV29-01+TCR… CTGACTGTGAGCAACATGAGC… eOX54      surface glyco…
 2 CASSDGWAAKLFF+TCRBV02-01+TC… AAGATCCGGTCCACAAAGCTG… eQD121     ORF1ab        
 3 CASSLVEDAISSYEQYF+TCRBV11-0… GCAAAGCTTGAGGACTCGGCC… eXL27      ORF1ab        
 4 CASSFRDIANYGYTF+TCRBV03-01/… AATTCCCTGGAGCTTGGTGAC… eOX52      ORF10         
 5 CASSPAGTGAYEQYF+TCRBV09-01+… AGCTCTCTGGAGCTGGGGGAC… eOX52      ORF1ab        
 6 CASSQVARNEKLFF+TCRBV03-01/0… ATCAATTCCCTGGAGCTTGGT… eAV93      ORF10         
 7 CASPPRQNDQETQYF+TCRBV06-06+… GAGTTGGCTGCTCCCTCCCAG… eXL31      ORF1ab        
 8 CASSLGFGELFF+TCRBV11-02+TCR… CTCAAGATCCAACCTGCAAAG… eXL30      surface glyco…
 9 CSARDPLAQNTGELFF+TCRBV20-X+… AGTGCCCATCCTGAAGACAGC… eMR16      surface glyco…
10 CASSSPHLAGGLNEQFF+TCRBV07-0… ACACAGCAGGAGGACTCCGCC… eEE226     ORF1ab        
# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`
# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,
#   `End Index in Genome` <dbl>
  • Q3: How many observations of how many variables are in the data?

This is a rather big data set, so let us start with two “tricks” to handle this, first:

  1. Write the data back into your data folder, using the filename peptide-detail-ci.csv.gz, note the appending of .gz, which is automatically recognised and results in gz-compression
  2. Now, check in your data folder, that you have two files peptide-detail-ci.csv and peptide-detail-ci.csv.gz, delete the former
  3. Adjust your reading-the-data-code in the “Load Data”-section, to now read in the peptide-detail-ci.csv.gz file

Click here for hint

Just as you can read a file, you can of course also write a file. Note the filetype we want to write here is csv. If you in the console type e.g. readr::wr and then hit the Tab key, you will see the different functions for writing different filetypes

Then:

  • T5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids. Remember to assign this new data to the same peptide_data variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the Environment pane and see which variables you have?

Once again, before we proceed, this is the data we will carry on with:

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 3
   Experiment `TCR BioIdentity`                       `Amino Acids`             
   <chr>      <chr>                                   <chr>                     
 1 eOX46      CSVGAGNTEAFF+TCRBV29-01+TCRBJ01-01      TLVPQEHYV                 
 2 eHO141     CASSNAANTEAFF+TCRBV05-01+TCRBJ01-01     AYKTFPPTEPK,KTFPPTEPK     
 3 eAV91      CASSTGGYTGELFF+TCRBV12-X+TCRBJ02-02     GVEHVTFFIY,HVTFFIYNK,STDT…
 4 eOX49      CASSLSRTQQPQHF+TCRBV12-X+TCRBJ01-05     ITEEVGHTDLMAAY            
 5 eXL30      CSARDPPRGGTEAFF+TCRBV20-X+TCRBJ01-01    AFLLFLVLI,FLAFLLFLV,FYLCF…
 6 ePD83      CASSIGQGISYEQYF+TCRBV19-01+TCRBJ02-07   SEHDYQIGGYTEKW,YQIGGYTEK,…
 7 eQD124     CASMVRMNTGELFF+TCRBV02-01+TCRBJ02-02    YLQPRTFL,YLQPRTFLL,YYVGYL…
 8 eEE228     CASSLEVVDSGTDTQYF+TCRBV05-05+TCRBJ02-03 LLYDANYFL,LLYDANYFLC,LYDA…
 9 eXL27      CASSPRDNEQFF+TCRBV04-01+TCRBJ02-01      ELYSPIFLI,LYSPIFLIV,QELYS…
10 eOX52      CASRPPGRSYEQYF+TCRBV28-01+TCRBJ02-07    FVDGVPFVV                 
  • Q4: Is this tidy data? Why/why not?

  • T6: See if you can find a way to create the below data, from the above

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 5
   Experiment CDR3b             V_gene     J_gene     `Amino Acids`             
   <chr>      <chr>             <chr>      <chr>      <chr>                     
 1 eEE240     CASSQDRIAMNTEAFF  TCRBV04-01 TCRBJ01-01 APKEIIFL,KEIIFLEGETL      
 2 eXL31      CSADRNTEAFF       TCRBV29-01 TCRBJ01-01 YIFFASFYY                 
 3 eXL32      CASSLGAATEQYF     TCRBV05-01 TCRBJ02-07 YIFFASFYY                 
 4 eXL30      CASSDQETQYF       TCRBV10-01 TCRBJ02-05 APKEIIFL,KEIIFLEGETL      
 5 eEE226     CASSYTGGYQETQYF   TCRBV09-01 TCRBJ02-05 AFLLFLVLI,FLAFLLFLV,FYLCF…
 6 eXL27      CAGWGTSEGYTF      TCRBV07-09 TCRBJ01-02 AFLLFLVLI,FLAFLLFLV,FYLCF…
 7 eEE228     CASSIRSAYEQYF     TCRBV19-01 TCRBJ02-07 FFSNVTWFH,FLPFFSNVT,LPFFS…
 8 eLH48      CSARKLAGSSYEQYF   TCRBV20-X  TCRBJ02-07 QYIKWPWYI,YEQYIKWPW,YEQYI…
 9 eXL27      CSASPRDSNNEQFF    TCRBV20-X  TCRBJ02-01 AFLLFLVLI,FLAFLLFLV,FYLCF…
10 eOX52      CASSPVSGGAGTDTQYF TCRBV04-02 TCRBJ02-03 ELYSPIFLI,LYSPIFLIV,QELYS…

Click here for hint

First: Compare the two datasets and identify what happened? Did any variables “disappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function to separate a composite (untidy) column into a set of new variables based on a separator? But what is a separator? Just like when you read a file with Comma Separated Values, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\x”, where x denotes the character needing to be “escaped”
  • T7: Add a variable, which counts how many peptides are in each observation of Amino Acids

Click here for hint

We have been working with the stringr package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_ and then hit the Tab key to see relevant functions
peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 6
   Experiment CDR3b             V_gene     J_gene     `Amino Acids`   n_peptides
   <chr>      <chr>             <chr>      <chr>      <chr>                <dbl>
 1 eMR18      unproductive      TCRBV05-01 TCRBJ02-02 YLQPRTFL,YLQPR…          3
 2 eEE226     CASAWGVSYEQYF     TCRBV27-01 TCRBJ02-07 FLNGSCGSV                1
 3 eHO135     CASSDTGELFF       TCRBV06-X  TCRBJ02-02 FLQSINFVR,FLQS…         13
 4 eOX43      CASRRDHPAGNTEAFF  TCRBV10-01 TCRBJ01-01 TVLSFCAFA,VLSF…          2
 5 eMR12      CASSYPVSGRSYNEQFF TCRBV06-X  TCRBJ02-01 HTTDPSFLGRY              1
 6 eOX46      CASRLYTEAFF       TCRBV10-02 TCRBJ01-01 IMLIIFWFSL,MLI…          2
 7 eOX49      CASSSLSGATGQFF    TCRBV19-01 TCRBJ02-01 LLYDANYFL,LLYD…          6
 8 ePD83      CASSMGQGLKHEQYF   TCRBV19-01 TCRBJ02-07 SEHDYQIGGYTEKW…          3
 9 eOX49      CATSDSQGLNTEAFF   TCRBV24-01 TCRBJ01-01 FLNGSCGSV                1
10 eXL31      CASSQDRGGSNGYTF   TCRBV04-03 TCRBJ01-02 YFPLQSYGF                1
  • T8: Re-create the following plot

  • Q4: What is the maximum number of peptides assigned to one observation?

  • T9: Using the str_c() and the seq() functions, re-create the below

[1] "peptide_1" "peptide_2" "peptide_3" "peptide_4" "peptide_5"

Click here for hint

If you’re uncertain on how a function works, try going into the console and in this case e.g. type str_c("a", "b") and seq(from = 1, to = 3) and see if you combine these?
  • T10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data

Click here for hint

In the console, write ?separate and think about how you used it earlier. Perhaps you can not only specify a vector to separate into, but also specify a function, which returns a vector?
peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 18
   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    
 1 eEE226     CASSLGGGANG… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF
 2 eXL31      CSASGLSDTIYF TCRBV… TCRBJ… KLSYGIATV <NA>      <NA>      <NA>     
 3 eEE226     CASSLLGGEST… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
 4 ePD83      CASTKGLGLNT… TCRBV… TCRBJ… SEHDYQIG… YQIGGYTEK YQIGGYTE… <NA>     
 5 eMR12      unproductive TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 6 eEE226     CASSPTHSNQP… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 7 eHH175     CASSSGGYEQYF TCRBV… TCRBJ… LSPRWYFYY SPRWYFYYL <NA>      <NA>     
 8 eEE226     CASSLDRLAGF… TCRBV… TCRBJ… ITDVFYKE… SEYKGPIT… <NA>      <NA>     
 9 eAV93      CASSYPTSGRE… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
10 eAM23      CASSQDPAGLN… TCRBV… TCRBJ… QYIKWPWYI YEQYIKWPW YEQYIKWP… <NA>     
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>
  • Q5: Now, presumable you got a warning, discuss in your group why that is?

  • Q6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?

Now, finally we will use the what we prepared for today, data pivoting. There are two functions, namely pivot_wider() and pivot_longer(). Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the slice_sample() function several times above and we can use that to randomly sample n observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n gradually to see if everything continues to work as anticipated.

  • T11: Using the peptide_data, run a few slice_sample() calls with varying degree of n to make sure, that you get a feeling for what is going on

  • T12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse the peptide_data variable as we don’t want huge redundant data sets floating around in our environment

Click here for hint

If the pivoting is not clear at all, then do what I do, create some example data:

my_data <- tibble(
  id = str_c("id_", 1:10),
  var_1 = round(rnorm(10),1),
  var_2 = round(rnorm(10),1),
  var_3 = round(rnorm(10),1))

…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using pivot_wider()/pivot_longer(). Use View() to inspect and get a better overview of the results of pivoting.

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 7
   Experiment CDR3b           V_gene     J_gene     n_peptides peptide_n peptide
   <chr>      <chr>           <chr>      <chr>           <dbl> <chr>     <chr>  
 1 eOX52      CASSQDSGPGQPQHF TCRBV04-01 TCRBJ01-05          5 peptide_… <NA>   
 2 eXL30      CASSQGSHTEAFF   TCRBV19-01 TCRBJ01-01          4 peptide_2 AYSNNS…
 3 eXL31      CASSPYSGNWDEQYF TCRBV07-02 TCRBJ02-07         11 peptide_… SLIDFY…
 4 eQD126     CATIRDSTYEQYF   TCRBV28-01 TCRBJ02-07          1 peptide_8 <NA>   
 5 eQD121     CAWSDGTYEQYF    TCRBV30-01 TCRBJ02-07          1 peptide_… <NA>   
 6 eEE224     CASSPRHGVNSPLHF TCRBV28-01 TCRBJ01-06          1 peptide_… <NA>   
 7 ePD84      CSASQTSGSQETQYF TCRBV20-01 TCRBJ02-05         11 peptide_5 IDFYLC…
 8 eXL27      CAWRGTASYEQYF   TCRBV30-01 TCRBJ02-07         11 peptide_9 MIELSL…
 9 eAV91      CATSDFSGVNEQFF  TCRBV24-01 TCRBJ02-01          2 peptide_… <NA>   
10 eOX43      CATSFGTGAYEQYF  TCRBV15-01 TCRBJ02-07          1 peptide_5 <NA>   
  • Q7: You will see some NAs in the peptide variable, discuss in your group from where these arise?

  • Q8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?

  • T13: Now, lose the redundant variables n_peptides and peptide_n, get rid of the NAs in the peptide column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 5
   Experiment CDR3b               V_gene     J_gene     peptide   
   <chr>      <chr>               <chr>      <chr>      <chr>     
 1 eOX43      CSARSSGSVEQYF       TCRBV20-X  TCRBJ02-07 FLWLLWPVT 
 2 eEE224     CASSSPGQGFYEQYF     TCRBV28-01 TCRBJ02-07 QELYSPIFL 
 3 eMR14      CASSLAGGLRQPQHF     TCRBV05-01 TCRBJ01-05 LSPRWYFYY 
 4 eOX49      CASTWKVVLERRDLWGYTF TCRBV06-08 TCRBJ01-02 LWLLWPVTL 
 5 eHO130     CASSGIGDPSGANVLTF   TCRBV28-01 TCRBJ02-06 LLYDANYFL 
 6 eEE240     CASSYSGGANYGYTF     TCRBV06-05 TCRBJ01-02 QSINFVRII 
 7 eEE240     CASSSGGRVNTEAFF     TCRBV09-01 TCRBJ01-01 LLFLVLIML 
 8 eEE240     CSASYGYTF           TCRBV29-01 TCRBJ01-02 SLIDFYLCFL
 9 eHO124     CASSFNEAGARYGYTF    TCRBV27-01 TCRBJ01-02 LIDFYLCFL 
10 eOX52      CATSDLREDSETQYF     TCRBV24-01 TCRBJ02-05 TPSGTWLTY 
  • Q8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?

Again, we turn to the stringr package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV.

  • T14: Use the str_detect() function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate parameter so see what happens

Click here for hint

Again, try to play a bit around with the function in the console, type e.g. str_detect(string = "ARND", pattern = "A") and str_detect(string = "ARND", pattern = "C") and then recall, that the filter() function requires a logical vector, i.e. a vector of TRUE and FALSE to filter the rows
  • T15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences

Click here for hint

Again, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?
peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 7
   Experiment CDR3b              V_gene         J_gene peptide k_CDR3b k_peptide
   <chr>      <chr>              <chr>          <chr>  <chr>     <int>     <int>
 1 eOX46      CASSSGFYEQYF       TCRBV06-02/06… TCRBJ… NVFAFP…      12         9
 2 eOX54      CASSSTWIVEAFF      TCRBV07-09     TCRBJ… YTVSCL…      13        10
 3 eEE226     CASSIMVGTGALNYGYTF TCRBV19-01     TCRBJ… FLNGSC…      18         9
 4 eOX54      CASSQDLSPMNTEAFF   TCRBV14-01     TCRBJ… KVSIWN…      16        10
 5 eOX54      CSVGSPTSSYNEQFF    TCRBV29-01     TCRBJ… APAHIS…      15         8
 6 eHO135     CASSRARWGTDTQYF    TCRBV28-01     TCRBJ… CNDPFL…      15        10
 7 eOX54      CASSPHSGGETEAFF    TCRBV07-09     TCRBJ… FKVSIW…      15        10
 8 eXL31      CASSLSVGTGIPYEQYF  TCRBV27-01     TCRBJ… LIDFYL…      17         9
 9 ePD82      CSVDGGWGEQYF       TCRBV29-01     TCRBJ… FIASFR…      12         9
10 eEE224     CSARDRSQPQHF       TCRBV20-X      TCRBJ… LEYHDV…      12        10
  • T16: Re-create this plot

  • Q9: What is the most predominant length of the CDR3b-sequences?

  • T17: Re-create this plot

  • Q10: What is the most predominant length of the peptide-sequences?

  • Q11: Discuss in your group, if this data set is tidy or not?

peptide_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 7
   Experiment CDR3b                 V_gene     J_gene  peptide k_CDR3b k_peptide
   <chr>      <chr>                 <chr>      <chr>   <chr>     <int>     <int>
 1 eXL31      CSAPLGTVTWDTQYF       TCRBV20-X  TCRBJ0… FYLCFL…      15        10
 2 eAV93      CASSLSIWSGELFF        TCRBV27-01 TCRBJ0… TLACFV…      14        10
 3 eEE226     CASKIDPGGPTDTQYF      TCRBV12-X  TCRBJ0… NPANNA…      16        10
 4 eAV91      CASSSPDTDTQYF         TCRBV06-05 TCRBJ0… MIELSL…      13        10
 5 eXL30      CASSYSPPTSGGAGGTGELFF TCRBV06-X  TCRBJ0… AEAELA…      21        11
 6 eAV93      CASSQDWPDYNSPLHF      TCRBV04-02 TCRBJ0… GMEVTP…      16        11
 7 eEE226     CASSYGRGVSANTGELFF    TCRBV07-09 TCRBJ0… VSIWNL…      18        10
 8 eJL162     CASSADGDQPQHF         TCRBV06-X  TCRBJ0… HTTDPS…      13        11
 9 eEE240     CASSFVDEQYF           TCRBV11-03 TCRBJ0… AFPFTI…      11         9
10 eXL27      CASVQGNTGELFF         TCRBV02-01 TCRBJ0… FYLCFL…      13        10

Creating one data set from two data sets

Before we move onto using the family of *_join() functions you prepared for today, we will just take a quick peek at the meta data again:

meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 11
   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 eAV88      Healthy (N…    24 M      White "A*0… "A*0… "B*2… "B*4… "C*0… "C*0…
 2 eMR17      COVID-19-C…    NA <NA>   <NA>  "A*0… "A*2… "B*5… "B*5… "C*0… "C*0…
 3 eJL162     COVID-19-C…    61 M      <NA>  "A*0… "A*0… "B*4… "B*5… "C*0… "C*0…
 4 eNL192     COVID-19-C…    NA <NA>   <NA>  ""    ""    ""    ""    ""    ""   
 5 eDH113     Healthy (N…    56 <NA>   <NA>  "A*0… "A*2… "B*1… "B*4… "C*0… "C*1…
 6 eMR21      COVID-19-B…    53 F      White "A*2… "A*2… "B*0… "B*4… "C*0… "C*1…
 7 eLH57      COVID-19-C…    NA <NA>   <NA>  "A*0… "A*0… "B*0… "B*3… "C*0… "C*0…
 8 eLH54      COVID-19-C…    NA <NA>   <NA>  "A*0… "A*0… "B*0… "B*4… "C*0… "C*0…
 9 eQD116     COVID-19-C…    66 F      <NA>  "A*0… "A*1… "B*3… "B*3… "C*0… "C*0…
10 eAM23      COVID-19-C…    48 M      <NA>  "A*1… "A*2… "B*1… "B*5… "C*0… "C*1…

Remember you can scroll in the data.

  • Q12: Discuss in your group, if this data with respect to the A1, A2, B1, B2, C1 and C2 variables is a wide or a long data format?

As with the peptide_data, we will now have to use data pivoting again. I.e.:

  • T18: use either pivot_wider() or pivot_longer() to create the following data:
meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 7
   Experiment Cohort                        Age Gender Race         Gene  Allele
   <chr>      <chr>                       <dbl> <chr>  <chr>        <chr> <chr> 
 1 eNL189     COVID-19-Exposed               NA <NA>   <NA>         A2    ""    
 2 eLH49      COVID-19-Convalescent          76 M      <NA>         B1    "B*07…
 3 eMR25      COVID-19-Convalescent          21 F      <NA>         A2    ""    
 4 eOX54      Healthy (No known exposure)    39 F      African Ame… A1    "A*02…
 5 eTH332     COVID-19-Convalescent          NA <NA>   <NA>         B1    ""    
 6 eQD129     COVID-19-Convalescent          60 F      White        A1    "A*02…
 7 eHO125     COVID-19-Convalescent          52 M      <NA>         A1    "A*02…
 8 eDH113     Healthy (No known exposure)    56 <NA>   <NA>         B1    "B*18…
 9 eHO132     COVID-19-Convalescent          65 F      White        C2    "C*08…
10 eHH174     Healthy (No known exposure)    31 F      White        A2    "A*02…

Remember, what we are aiming for here, is to create one data set from two. So:

  • Q13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?

Once you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:

meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 2
   Experiment Allele      
   <chr>      <chr>       
 1 eHO141     ""          
 2 eXL32      "A*01:01"   
 3 eLH57      "B*07:02:01"
 4 ePD73      "C*03:04"   
 5 ePD84      "A*02:01"   
 6 eLH53      "C*03:03:01"
 7 eLH51      "A*34:01:01"
 8 eQD108     "A*68:01:02"
 9 eOX52      "B*40:01"   
10 eHO138     ""          

Use the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.

Long story short, we only want to include Field 1 (allele group) and Field 2 (Specific HLA protein). You have prepared the stringr package for today. See if you can find a way to reduce e.g. B*51:01:02 to B*51:01 and then create a new variable Allele_F_1_2 accordingly, while also removing the ...x (where x is a number) subscripts from the Gene variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NAs and ""s, denoting empty entries.

Click here for hint

There are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps the stringr package contains a function work with substring? In the console, type stringr:: and hit tab. This will display the functions available in the stringr package. Scroll down and find the functionst starting with str_ and look for on, which might be relevant and remember you can use ?function_name to get more information on how a given function works.
  • T19: Create the following data, according to specifications above:
meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 3
   Experiment Allele     Allele_F_1_2
   <chr>      <chr>      <chr>       
 1 eQD113     A*03:01:01 A*03:01     
 2 eMR20      A*02:01:01 A*02:01     
 3 eHO129     C*15:02:01 C*15:02     
 4 eAV105     B*07:02:01 B*07:02     
 5 eJL143     A*32:01    A*32:01     
 6 eHH174     B*51:01    B*51:01     
 7 eQD134     B*15:02:01 B*15:02     
 8 eAV88      B*40:01    B*40:01     
 9 eJL160     B*44:02:01 B*44:02     
10 eEE240     C*06:02    C*06:02     

The asterisk, i.e. * is a rather annoying character because of ambiguity, so:

  • T20: Clean the data a bit more, by removing the asterisk and redundant variables:
meta_data |> 
  slice_sample(n = 10)
# A tibble: 10 × 2
   Experiment Allele
   <chr>      <chr> 
 1 eLH41      B13:02
 2 eXL36      A01:01
 3 eHO133     B52:01
 4 eLH54      A03:01
 5 eJL161     C07:01
 6 eLH41      C06:02
 7 eQD111     C07:02
 8 eEE240     C06:02
 9 eXL32      A01:01
10 eLH58      C03:04

Click here for hint 1

Again, the stringr package may come in handy. Perhaps there is a function remove, one or more such pesky characters?

Click here for hint 2

Getting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…

Recall the peptide_data?

peptide_data |>
  slice_sample(n = 10)
# A tibble: 10 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eEE240     CASSQEGRPPTDTQYF TCRBV04-01 TCRBJ02-03 YLCFLAFLL      16         9
 2 eEE240     CASSLDPRSTSYEQYF TCRBV11-03 TCRBJ02-07 LLYDANYF…      16        10
 3 eEE226     CAIRTGTTNEKLFF   TCRBV12-X  TCRBJ01-04 QPTESIVRF      14         9
 4 ePD85      CASSTGLGVVQPQHF  TCRBV19-01 TCRBJ01-05 YQIGGYTEK      15         9
 5 eAV93      CATSQGNEQYF      TCRBV14-01 TCRBJ02-07 WLLWPVTLA      11         9
 6 eOX52      CASRVRDKISPLHF   TCRBV27-01 TCRBJ01-06 APKEIIFL       14         8
 7 eEE228     CASSPRSGYEQYF    TCRBV19-01 TCRBJ02-07 LPFFSNVT…      13        10
 8 eXL30      CASSPGELEQYF     TCRBV07-06 TCRBJ02-07 IELSLIDF…      12        10
 9 eXL27      CASSLIDRTATDTQYF TCRBV28-01 TCRBJ02-03 AFPFTIYSL      16         9
10 eOX49      CASSLGGGTANYGYTF TCRBV13-01 TCRBJ01-02 MLIIFWFSL      16         9
  • T21: Create a dplyr pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)

Click here for hint 1

Which family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set

Click here for hint 2

You should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting the peptide_data prior to joining to e.g. 100,000 or 10,000 rows.
peptide_meta_data |>
  slice_sample(n = 10)
# A tibble: 10 × 8
   Experiment CDR3b             V_gene   J_gene peptide k_CDR3b k_peptide Allele
   <chr>      <chr>             <chr>    <chr>  <chr>     <int>     <int> <chr> 
 1 eXL37      CSADVSGAYNEQFF    TCRBV20… TCRBJ… FYLCFL…      14        10 B44:03
 2 eQD134     CASSLRTSGETDTQYF  TCRBV04… TCRBJ… LPFFSN…      16        10 B15:02
 3 eXL31      CASGWGFYEQYF      TCRBV06… TCRBJ… LIDFYL…      12         9 B44:03
 4 eXL27      CASSKGESSYNEQFF   TCRBV21… TCRBJ… INVFAF…      15        10 A02:01
 5 eHO135     CASSTTLLDGSYEQYF  TCRBV07… TCRBJ… VTPSGT…      16        10 C04:01
 6 ePD76      CASSKGGWNIQYF     TCRBV07… TCRBJ… LLFVTV…      13        10 C03:04
 7 eAV93      CASSLGAGEGKLFF    TCRBV05… TCRBJ… CNDPFL…      14        10 B35:01
 8 eOX46      CSAEEEDWNGVNNEQFF TCRBV20… TCRBJ… FYLCFL…      17         9 B35:03
 9 eDH113     CASSLSVYEQYF      TCRBV12… TCRBJ… FLWLLW…      12         9 A29:02
10 eMR16      CASSYTGNQPQHF     TCRBV06… TCRBJ… VYFLQS…      13         9 B18:01

Analysis

Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:

  1. What characterises the peptides binding to the HLAs?
  2. What characterises T-cell Receptors binding to the pMHC-complexes?

Peptides binding to HLA

As we have touched upon multiple times, R is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo (More info here).

  • T22: Subset the final peptide_meta_data data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo

Click here for hint

You can pipe a vector of peptides into ggseqlogo, but perhaps you first need to pull that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9

  • T23: Repeat for e.g. B07:02 or another of your favourite alleles

Now, let’s take a closer look at the sequence logo:

  • Q14: Which positions in the peptide determines binding to HLA?

Click here for hint

Recall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?

CDR3b-sequences binding to pMHC

  • T24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:

  • Q15: In your group, discuss what you see?

  • T25: Play around with other combinations of k_CDR3b, Allele, and peptide and inspect how the logo changes

Disclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here.

Epilogue

That’s it for today - I know this is overwhelming now, but commit to it and you WILL be plenty rewarded! I hope today was at least a glimpse into the flexibility and capabilities of using tidyverse for applied Bio Data Science

…also, noticed something? We spend maybe 80% of the time here on dealing with data-wrangling and then once we’re good to go, the analysis wasn’t that time consuming - That’s often the way it ends up going. You’ll spend a lot of time on data handling, and getting the tidyverse toolbox in your tool belt will allow you to be so much more efficient in your data wrangling, so you can get to the fun part as quickly as possible!

Today’s Assignment

After today, we are halfway through the labs of the course, so now is a good time to spend some time recalling what we have been over and practising writing a reproducible Quarto-report.

Your group assignment today is to condense the exercises into a group micro-report! Talk together and figure out how to distil the exercises from today into one small end-to-end runnable reproducible micro-report. DO NOT include ALL of the exercises, but rather include as few steps as possible to arrive at your results. Be very concise!

But WHY? WHY are you not specifying exactly what we need to hand in? Because we are training taking independent decisions, which is crucial in applied bio data science, so take a look at the combined group code, select relevant sections and condense - If you don’t make it all the way through the exercises, then condense and present what you were able to arrive at! What do you think is central/important/indispensable? Also, these hand ins are NOT for us to evaluate you, but for you to train creating products and the get feedback on your progress!

IMPORTANT: Remember to check the ASSIGNMENT GUIDELINES

…and as always - Have fun!