Lab 1: Course Intro & the Very Basics

Package(s)

Schedule

  • 08.00 - 08.15: Arrival and 5 min. pre-course anonymous survey: Click here to answer
  • 08.15 - 08.45: Lecture: Course Introduction
  • 08.45 - 09.00: Break
  • 09.00 - 11.15: Exercises
  • 11.15 - 11.30: Break
  • 11.30 - 12.00: Lecture: Reproducibility in Modern Bio Data Science

Learning Materials

Please prepare the following materials:

Learning Objectives

A student who has met the objectives of the session will be able to:

  • Master the very basics of R
  • Navigate the RStudio IDE
  • Create, edit and run a basic Quarto document
  • Explain why reproducible data analysis is important, as well as identify relevant challenges and explain replicability versus reproducibility
  • Describe the components of a reproducible data analysis

Exercises

Today, we will focused on getting you started and up and running with the first elements of the course, namely the RStudio IDE (Integrated Developer Environment) and Quarto. If the relationship between R and RStudio is unclear, think of it this way: Consider a car, in that case, R would be the engine and RStudio would be the rest of the car. Neither is particularly useful, but together they form a functioning unit. Before you continue, make sure you in fact did watch the “RStudio for the Total Beginner” video (See the Learning Materials for todays session).

Cloud server and the RStudio IDE

Go to the R for Bio Data Science Cloud Server and follow the login procedure. Upon login, you will see this:

This is the RStudio IDE. It allows you to consolidate all features needed to develop R code for analysis. Now, click Tools \(\rightarrow\) Global Options... \(\rightarrow\) Pane Layout and you will see this:

This outlines the four panes you have in your RStudio IDE and allow you rearrange them as you please. Now, re-arrange them, so that they look like this:

Click Apply \(\rightarrow\) OK and you should see this:

First steps

The Console

Now, in the console, as you saw in the video, you can type commands like:

2+2
1:100
3*3
sample(1:9)
X <- matrix(sample(1:9), nrow = 3, ncol = 3)
X
sum(X)
mean(X)
?sum
sum
nucleotides <- c("a", "c", "g", "t")
nucleotides
sample(nucleotides, size = 100, replace = TRUE)
table(sample(nucleotides, size = 100, replace = TRUE))
paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = "")
replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = ""))
df <- data.frame(id = paste0("seq", 1:10), seq = replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = "")))
df
str(df)
ls()

Take some time and play around with these commands and other things you can come up with. Use the ?function to get help on what that function does. Be sure to discuss what you observe in the console. Do not worry too much on the details for now, we are just getting started. But as you hopefully can see, R is very flexible and basically the message is: “If you can think it, you can build it in R”.

  • Go to Chapter 3 in R4DS2e and do the exercises

The Terminal

Notice how in the console pane, you also get a Terminal, click and enter:

ls
mkdir tmp
touch tmp/test.txt
ls tmp
rm tmp/test.txt
rmdir tmp
ls
echo $SHELL

Basically, here you have access to a full terminal, which can prove immensely useful! Note, you may or may not be familiar with the concept of a terminal. Simply think of it as a way to interact with the computer using text command, rather than clicking on icons etc. Click back to the console.

The Source

The source is where you will write scripts. A script is a series of commands to be executed sequentially, i.e. first line 1, then line 2 and so on. Right now, you should have a open script called Untitled1. If not, you can create a new script by clicking white paper with a round green plus sign in the upper left corner.

Taking inspiration from the examples above, try to write a series of commands and include a print()-statement at the very end. Click File \(\rightarrow\) Save and save the file as e.g. my_first_script.R. Now, go to the console and type in the command source("my_first_script.R"). Congratulations! You have now written your very first reproducible R-program!

The Whole Shebang

Enough playing around, let us embark on our modern Bio Data Science in R journey.

  • In the Files pane, click New Folder and create a folder called projects
  • In the upper right corner, click where it says Project: (None) and then click New Project...
  • Click New Directory and then New Project
  • In the Directory name:, enter e.g. r_for_bio_data_science
  • Click the Browse... button and select your newly created projects directory and then click Choose
  • Click Create Project and wait a bit for it to get created

On Working in Projects

Projects allow you to create fully transferable bio data science projects, meaning that the root of the project will be where the .Rproj file is located. You can confirm this by entering getwd() in the console. This means that under no circumstance should ever not work in a project nor should ever use absolute paths. Every single path you state in your project must be relative to the project root.

But why? Imagine you have create a project, where you have indeed used absolute paths. Now you want to share that project with a colleague. Said colleague gets your project and tests the reproducibility by running the project end-to-end. But it completely fails because you have hardcoded your paths to be absolute, meaning that all files and project ressources locations points to locations on your laptop.

Projects are a must and allows you to create reproducible encapsulated bio data science projects. Note, the concept of reproducibility is absolute central to this course and must be considered in all aspect of the life cycle of a project!

Quarto

While .R-scripts are a perfectly, there is another Skywalker:

  • In the upper left corner, again, click the white paper with the round green plus, but this time select Quarto Document
  • Enter a Title:, e.g. “Lab 1 Exercises” and enter your name below in the box Author:
  • Click Create
  • Important: Save your Quarto document! Click File \(\rightarrow\) Save and name it e.g. lab_01_exercises.qmd
  • Minimise the Environment-pane

You should now see something like this:

Try clicking the Render button just above the quarto-document. This will create the HTML5 output file. Note! You may get a Connection Refused message. If so, don’t worry, just close the page to return to the cloud server and find the generated .html file, left-click and select View in Web Browser.

If you have previously worked with Rmarkdown, then many features of Quarto will be familiar. Think of Quarto as the complete rethink of Rmarkdown, i.e. based on all the experience gained, what would the optimal way of constructing an open-source scientific and technical publishing system?

Perhaps you have previously encountered Jupyter notebooks, Quarto is similar. The basic idea is to have one document covering the entire project cycle.

Proceed R for Data Science (2e), chapter 29 and do the exercises.

Bio Data Science with a Virtual AI Assistant

I am pretty sure you all know of chatGPT by now, so let us address the elephant in the room!

Getting started

  • Go to the ChatGPT site
  • Create a user and login
Let us get acquainted

Now, at the bottom it says “Send a message”, let us ask 3 simple question and see if we can find out what is what. Type the following questions in the prompt and read the answers:

  1. Explain in simple terms what you are
  2. Explain in simple terms how you work
  3. Explain in simple terms how you can be used to generate value as a virtual AI assistant, when doing Bio Data Science for R
Moving onto R

In the upper left corner, it says + New chat, click it to start a new session.

Again, type the following questions in the prompt and read the answers:

  1. R
  2. What is R?
  3. Give a few simple examples

If you do get some code examples, try to copy/paste into the console in RStudio and see if they run

Prompt Engineering

Start a new chat and enter:

  1. Give a few simple examples

Compare the response with the one from before, is it the same or different and why so?

Discuss in your group what is “Prompt Engineering” and how does it relate to the above few tests you did? (Bonus info: Prompt engineer is already a job and people are making money off of selling prompts)

…a bit more

Start a new chat and enter:

  • Tell me about DNA

Check if it gives you correct information?

Now, write:

  • Give a few fun examples on how to get started with the R programming language using DNA

Copy/paste the code into the Console, do the examples all run?

  • Earlier you were told to play around with some code snippets. Perhaps you didn’t fully catch what was going on? If so, try to type in e.g.:
I'm new to R, please explain in simple terms, what the following code does:

"
nucleotides <- c("a", "c", "g", "t")
replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = ""))
"
Summary

While chatGPT can be a powerful tool for code productivity, it comes with a major caveat: When it fails, it fails with confidence. This means that it will be equally confident whether it is right or wrong! The optimal yield is when you are working on a problem with a tool, where you already have a good knowledge of the problem and the tool. Here, you can see if you are heading in the right direction or if you’re being send on a wild goose chase.

Therefore, in the beginning of the course, for your own sake, please refrain from using chatGPT!

Instead solve the exercises based on the materials you have prepared, talk to you group members and discuss the challenges and check your understanding. Then later on, when you have gained initial experience, we can explore using chatGPT to augment your bio data science workflow to enhance productivity