Lab 7: Collaborative Bio Data Science using GitHub via RStudio

Package(s)

Schedule

Learning Materials

Please prepare the following materials

Learning Objectives

A student who has met the objectives of the session will be able to:

  • Explain why reproducible data analysis is important
  • Identify associated relevant challenges
  • Describe the components of a reproducible data analysis
  • Use RStudio and GitHub (git) for collaborative bio data science projects

Exercises

Prologue

Using git is completely central for collaborative bio data science. You can use git not only for R, but for any code-based collaboration. All bioinformatics departments in major players in Danish Pharma, are using git - Learning git is a key bio data science skill! Today, you will take your first steps - Happy Learning!

  • T1: Find the GitHub repository for the ggplot2 R-package
  • Q1: How many Contributors are there?
  • T1: Find the GitHub repository for the Linux kernel
  • Q2: How many Contributors are there?
  • Q3: Discuss in your group why having an organised approach to version control is central? And consider the simple contrast of the challenges you were facing when doing the course assignments.

Getting Started

The following exercises have to be done in your groups! You must move at the same pace and progress together as a team through the exercises!

GitHub is the place for collaborative coding and different group members will have to do different tasks in a specific order, to make it through the exercises together, so… Team Up and don’t rush it!

First, select a team Captain, that person will have to carry out specific tasks. If Team is stated, then that refers to all group members and lastly if Crew is stated, then that is everyone but the Captain. Please note, that tasks are sequential, so if a task is assigned to the Captain, then the Crew has to await completion before proceeding!

Team

  1. Go to GitHub and login
  2. In the upper left corner, click Repositories

Captain

  1. The default owner of the repository should be your GitHub username. Keep that, please.
  2. Name the repository groupXX, where XX is your group number, e.g. 02
  3. Select Public
  4. Tick Add a README file
  5. Click Create repository
  6. Click Settings in the menu line starting with <> Code
  7. Under Access, click Collaborators and teams
  8. Click Add people
  9. Write a group members username
  10. Click the correct suggestion
  11. Select Add username to this repository
  12. Repeat for all group members

Crew

  1. Check your mail and accept your invitation to join the repository

Team

  1. Click here to go to the course RStudio cloud server and login
  2. In the upper right corner, where it should say r_for_bio_data_science, click
  3. Choose New Project...
  4. Select Version Control
  5. Select Git
  6. Under Repository URL:, enter https://github.com/CAPTAIN_USERNAME/groupXX, where again you replace XX with your group number, e.g. 02. NOTE: Replace CAPTAIN_USERNAME with the captain’s GitHub username.
  7. Under Project directory name:, enter lab07_git_exercises
  8. Under Create project as a subdirectory of:, make sure that is says ~/projects
  9. Click Create Project

Congratulations! You have just cloned your first GitHub repository!

In your Files-pane, you should now see:

…and now in your Environment-pane, you should see a new git-tab:

If you click the git-tab, you should see:

Your first collaboration

Team

  1. Create a new Quarto document, title it student_id and save id as student_id.qmd, where student_id is your… You guessed it!
  2. In the environment-pane, click the git-tab
  3. Tick the 3 boxes under staged
  4. Click commit
  5. In the upper right corner, add a Commit message, e.g “First commit by student_id”
  6. Click the Commit-button
  7. A pop-up, will give you details on your commit, look through them and then click Close
  8. Now, very important ALWAYS click the Pull-button BEFORE clicking the Push-button
  9. Clicking Pull, you should see “Already up to date.”
  10. Then click Push
  • Q4: Discuss in your group, what happened and why? Try copying the message and paste it into google, did you get any hits?

Do not start solving this issue just yet, simply realise, that git is being used by literally millions of people on a daily basis, the chance that you will encounter a unique error, that no one else have ever seen is slim-to-nothing! So if you end up in trouble, googl’ing the message, is not a bad initial stab at the problem!

Setting up your Credentials and Personal Access Token (Team)

GitHub doesn’t know who you are, you’re just someone who cloned your first GitHub repository and now you want to do all sorts of stuff! We can’t have that, so let’s fix that GitHub doesn’t know who you are!

  1. In the Console, run the command:
usethis::use_git_config(user.name = "YOUR_GITHUB_USERNAME", user.email = "THE_EMAIL_YOU_USED_TO_CREATE_YOUR_GITHUB_ACCOUNT")
  1. In the Console, run the command:
usethis::create_github_token()
  1. Login
  2. Under Note, where it says DESCRIBE THE TOKEN'S USE CASE, delete the text and write e.g.R for Bio Data Science lab 7 git exercises
  3. Under Expiration, where it says 30 days, change that to 90 days
  4. Do not change any other setting, but simply scroll down and click Generate token
  5. A new Personal access tokens (classic)-page will appear, stating your personal access token, which starts with ghp_, go ahead and copy it
  6. Store the PAT somewhere save, e.g. in a password management tool
  7. Again in the console, run the below command and enter your PAT when prompted:
gitcreds::gitcreds_set()
  1. In the Environment-pane, make sure you are in the git-tab and again click Pull (We always pull before push)
  2. Click Push
  3. Go to your groups GitHub-project page https://github.com/CAPTAIN_USERNAME/groupXX
  4. Confirm, that you see your files

Congratulations! You just took your first step towards true collaborative bio data science!

Stop, wait a minute and make sure, that ALL group members are at this stage of the exercises.

Making your Credentials and Personal Access Token stick around (Team)

The thing is, we’re actually working on a Linux server her, this entails a that you’re credentials and PAT exists in a cache, which is cleared every 15 minutes. This. Is. Annoying! So let’s fix it!

First of all RStudio is a (G)raphical (U)ser (I)nterface, meaning, that it allows you to do pointy-clicky stuff, but at the end of the day, everything happens at the commandline. That also goes for git, so your button-pushing simply gets converted into commands, which are executed at the commandline. If you are not comfortable with the commandline, don’t worry, the pointy-clicky stuff will be sufficient for now! If you are comfortable with the commandline, I advice to get comfortable with git on the commandline as well, you do get some extra bang-for-the-buck!

Now, a brief visit of the commandline:

  1. In the Console-pane, click the Terminal-tab, which gains you access to the system underlying RStudio
  2. You should see something like user@pupilX:/net/pupilX/home/people/user/projects/lab07_git_exercises$
  3. Enter ls -a, this will list all the files, compare with your Files-pane, you will see that RStudio is “hiding” something from you, namely the .git folder
  4. Enter ls -a .git and you will see the content of the folder, this is where all the git-magic happens
  5. See the config file? That holds the configuration, enter git config --global --list
  6. Recall you entered your GitHub username and mail? This is where it ended up! Note the credential.helper=cache, which tells us, that the credentials are being cached. Now, enter git config --global credential.helper 'cache --timeout=86400'
  7. Re-run the command git config --global --list and confirm the change
  8. Go back to your Console and run the command gitcreds::gitcreds_set() and re-enter your PAT

Congratulations! You have now used the commandline and you will forever be part of an elite few, who know that everything you see in hacker-movies is BS, except perhaps for Wargames and Mr. Robot! Also, your PAT will now be good for 24h

Your next collaboration

That first collaboration was easy right!? Well, you were all working on different files…

Captain

  1. Create a new Quarto Document, title it “Group Document” and save it as group_document.qmd
  2. Using markdown headers, create one section for each group member, incl. yourself. Here, you can use your names, student ids or whatever you deem appropriate
  3. Again in the Environment-pane, make sure you are in the git-tab and then tick the box next to group_document.qmd
  4. Click Commit
  5. Add a commit message, e.g. “Add group document”
  6. Click the Commit-button
  7. Click Pull (You should be “Already up to date.”)
  8. Click Push
  9. Click Close
  10. Again, go to the group GitHub and confirm that you see the new document you just created

Crew

  1. Click Pull and confirm that you now also have the file group_document.qmd
  2. Open group_document.qmd and find your assigned section
  3. In your section and your section only, enter some text, add a few code chunks with some R-code
  4. Make sure to save the document
  5. Now again, find the group_document.qmd in the git-tab of the Environment-pane and tick the box under Staged
  6. Click Commit
  7. Note how your changes to the document are highlighted in green
  8. Add a commit message, e.g. “Update the STUDENT_ID section” and click Commit
  9. Click Close
  10. Click Pull
  11. Click Push
  12. Go to the group GitHub, find the group_document.qmd and click it, do you see your changes?

Once everyone has added to their assigned section, everyone should do a pull/push, so everyone has the complete version of the group_document.qmd.

Your first branching

Ok that’s pretty great so far - Right? The thing is… Consider, the ggplot2 repository, that you found in T1. Thousands of companies and even more thousands of people rely on ggplot for advanced data visualisation. What would happen, if you wanted to add a new feature or wanted to optimise an existing one, while people were actively installing your package? They would get what stage your code was in, which may or may not be functional - Enter branching!

Below here, is an illustration, consider the Master the stable version people can download and use and then Your Work will be the feature or update that you personally are working on. Someone Else’s Work will be another team and that persons work on a new feature or update. Before the last green circle, note how both the Your Work- and Someone Else’s Work-branches are merged onto the Master-branch.

There can be even more branches

  • Q4: Again, find the ggplot GitHub site and see if you can find how many branches there are?

Team

  1. Go to your RStudio session and in the git-tab of the Environment-pane, click New Branch
  2. In branch name, enter your student id
  3. You should see a pop-up Branch 'STUDENT_ID' set up to track remote branch 'STUDENT_ID' from 'origin'., go ahead and click Close
  4. Next to where you clicked New Branch just before, it should now say STUDENT_ID, click it and confirm that you see STUDENT_ID and Main
  5. Look at the illustration above and make sure you are on par with that from the original branch main, which is equivalent with Master, you have created a new branch STUDENT_ID, which is equivalent to Your Work
  6. Click STUDENT_ID and you will get a confirmation that you are already on that branch and that you are up-to-date, click Close
  7. In the group_document.qmd under your section, using markdown, enter a new sub-header and name it e.g. New feature or New analysis
  8. Enter some text, a few code chunks with a bit of R-code of your choosing and save the document
  9. Again, in the git-tab, tick the box under staged and click Commit
  10. Add a commit message, e.g. New feature from STUDENT_ID and click Commit
  11. Click Close, Pull, Close, Push and Close and then close the commit window
  12. Go to your group GitHub and confirm that you now have at least 2 branches
  13. Make sure you are in the main branch and then click the group_document.qmd, you should now not see your new feature/analysis that you added
  14. On the left, where it says main, click and select your STUDENT_ID, you should now see your new feature/analysis that you added

Congratulations! You have now succesfully done your first branching!

Your first branch merging

  1. Go to your groups GitHub page, at the top it should say STUDENT_ID had recent pushes..., click the Compare & pull request
  2. Your commit message will appear and where it says Leave a comment, add a comment like e.g. I'm done, all seems to be working now! or similar
  3. Click Create pull request
  4. It should now say This branch have no conflicts with the base branch, confirm and click Merge pull request
  5. Click Confirm merge after which it should now say Pull request successfully merged and closed
  6. You have now fully merged, so go ahead and click Delete branch
  7. Revisit the previous illustration and compare with your branch workflow, make sure that everyone in the groups are on par here
  8. Finally return to your RStudio session and make sure you switch to the main-branch

Congratulations! You have now succesfully done your first branch merge!

But wait, what was this Pull request??? What you actually did, was: 1. Created a new branch 1. Completed a new feature/analysis 1. Push’ed the new feature/analysis to GitHub 1. Created a Pull request for merging your branch STUDENT_ID into the main branch 1. Approved and completed the Pull request

Think about e.g. again the ggplot2-repo, if anyone could create a new branch and then do as described above, then there would be no way of making quality control. Therefore, typically such pull requests will have to be approved by someone. This can either be someone who is close-to-the-code e.g. in the case of ggplot2, that’d be someone like Hadley. In a company, then that might be some senior developer approving junior developers pull requests. At one point you might have seen something like “The min branch is unprotected”, this is exactly that!

Your first merge conflict

Okay, that’s all good an well. Seems easy and straight forward, right? Well, that is as long as we don’t have a conflict. A conflict is when two or more changes to one file are not compatible, consider:

Top 10 Data Science Languages of ALL Time:
10. It is 
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes
1. R

In fact, let’s screw things up!

Captain

  1. Go to the RStudio session and where you usually click Quarto Document..., this time let’s just create a simple Text File
  2. Copy/paste the Top 10 Data Science Languages of ALL Time into the file
  3. Save the file as best_ds_langs.txt
  4. Make sure you are in the main-branch
  5. Do a commit/pull/push and check the file ended at your group github

Crew

Let’s commit mutiny! (pun intended)

  1. Make sure you are in the main-branch and then hit Pull
  2. Open the best_ds_langs.txt
  3. Each of you separatly (wrongfully) replace R with an (inferior) data science language of your choice (you can really write anything)
  4. Now, this time we do not pull first, simply commit and then push
  5. Now you can pull
  6. You will now get notified about a merge conflict
  7. Close and re-open the best_ds_langs.txt-file, it should look something like:
Top 10 Data Science Languages of ALL Time:
10. It is 
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes
<<<<<<< HEAD
1. C++
=======
1. Python
>>>>>>> 85283ca3246b7f7462ab633085f92ac5f173d3e7

Captain

Get your Crew in order!

  1. Simply edit the file so that you delete everything below 2. and add the final line, which emphasises that 1. R is superiour!
  2. The box under Staged looks a bit odd right, just click it
  3. Click Commit and add a commit message, e.g. Fixed merge conflict, got crew back in order!
  4. Click Commit, Close, Pull, Close, Push, Close and close the commit window
  5. Return to your GitHub group and confirm all is in order and you can continue with confidence knowing that R is superiour! (Check the file to make sure)

Congratulations! You have now fixed your first merge conflict!

Lastly, click the History-button next to the Push-button and explore the history of how your branches and commits have changed through these exercises.

The .gitignore file

You may have noticed the .gitignore-file?

Team

  1. Go to your RStudio session, find the .gitignore-file and click it

It contains a list of files and folders, which should not end up at your git-repo, i.e. files which should be ignored. An important aspect of working with GitHub is that GitHub is meant for code, not data! Let’s say we had a data and a data/_raw/, we could add those to the .gitignore-file to avoid the data being included in our commits.

  1. Update the .gitignore-file and get the updated version to GitHub

Summary

You have not gotten to play around with collaborative coding using git - Well done!

If you are more curious for more, please feel free to play around with a new file, edit commit/pull/push etc. Perhaps also take an extra look at the GitHub, explore and learn more 👍

GROUP ASSIGNMENT

First of all, here we have just scratched the surface. You can read much more here.

Your assignment this time, will be to:

  1. Go to this post on “PCA tidyverse style” by Claus O. Wilke, Professor of Integrative Biology
  2. Again, create a reproducible micro-report, this time, where you do a code-along with either the data in the post or e.g. the gravier-data or ?
  3. Use your GitHub group repository to collaborate
  4. Your hand in will be a link to your micro-report in your GitHub group repository
  5. Make sure to check the Assignment Guidelines
  6. And also follow the Course Code Styling
  7. HOW TO HAND IN: Go to https://github.com/CAPTAIN_USERNAME/groupXX, copy the link and paste it into an empty text (.txt) file. Hand in this text file.

Note, focus here is on the git learning objectives and doing a code-along, which is not a copy/paste of the code, but using it as inspiration to create your own nice concise tidy micro-report!