::use_git_config(user.name = "YOUR_GITHUB_USERNAME", user.email = "THE_EMAIL_YOU_USED_TO_CREATE_YOUR_GITHUB_ACCOUNT") usethis
Lab 7: Collaborative Bio Data Science using GitHub via RStudio
Package(s)
Schedule
- 08.00 - 08.10: Midway evaluation
- 08.10 - 08.30: Recap of Lab 6
- 08.30 - 09.00: Lecture
- 09.00 - 09.15: Break
- 09.00 - 12.00: Exercises
Learning Materials
Please prepare the following materials
- Book: Happy Git and GitHub for the useR – Read chapters 1 (intro), 20 (basic terms), 22 (branching), 23 (remotes). Do not pay much attention to syntax of specific commands, because we are not going to use them during the exercises, focus on the idea
- Book: Introduction to Data Science - Data Analysis and Prediction Algorithms with R by Rafael A. Irizarry: Chapter 40 Git and GitHub – Some of the information here is redundant with the previous book, but very important thing is a visualization of basic git actions and screenshots of how to perform them using RStudio
- Video: RStudio and Git - an Overview (Part 1) – Basic git concepts, for those who prefer listen rather than read. Books, however, contain more information
- Video: How to use Git and GitHub with R – Basic operating on git in RStudio. Complementary to second book. You can skip to 2:50, we are not going to link to git manually either way
Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book
Learning Objectives
A student who has met the objectives of the session will be able to:
- Explain why reproducible data analysis is important
- Identify associated relevant challenges
- Describe the components of a reproducible data analysis
- Use RStudio and GitHub (git) for collaborative bio data science projects
Exercises
Prologue
Using git
is completely central for collaborative bio data science. You can use git
not only for R
, but for any code-based collaboration. All bioinformatics departments in major players in Danish Pharma, are using git
- Learning git
is a key bio data science skill! Today, you will take your first steps - Happy Learning!
- T1: Find the
GitHub
repository for theggplot2
R-package - Q1: How many Contributors are there?
- T1: Find the
GitHub
repository for theLinux kernel
- Q2: How many Contributors are there?
- Q3: Discuss in your group why having an organised approach to version control is central? And consider the simple contrast of the challenges you were facing when doing the course assignments.
Getting Started
First, make sure to read and discuss the feedback you got from last week’s assignment!
The following exercises have to be done in your groups! You must move at the same pace and progress together as a team through the exercises!
Be aware of which editor you are using in RStudio! You must work in the “Visual” editor instead of the “Source” editor. ALL OF YOU! If one of you works in a different editor, it could cause a conflict issue!
GitHub is the place for collaborative coding and different group members will have to do different tasks in a specific order, to make it through the exercises together, so… Team Up and don’t rush it!
First, select a team Captain, that person will have to carry out specific tasks. If Team is stated, then that refers to all group members and lastly if Crew is stated, then that is everyone but the Captain. Please note, that tasks are sequential, so if a task is assigned to the Captain, then the Crew has to await completion before proceeding!
Team
- Go to GitHub and login
- In the upper right corner, click on your profile picture. Then in the menu, click on “Your Organizations” and go into rforbiodatascienceYY (replace
YY
with year). If you are not a member, contact a TA NOW before proceeding. Next, go into repositories (top left).
Captain
- Click “New repository”
- Since you created the project from the organisation, the default owner of the repository is the organisation (rforbiodatascienceYY). Verify and keep that, please.
- Name the repository
groupXX
, whereXX
is your group number, e.g.02
- Select
Public
- Tick
Add a README file
- Click
Create repository
- Click
Settings
in the menu line starting with<> Code
- Under
Access
, clickCollaborators and teams
- Click
Add people
- Write a group members username and select the role “Maintain”.
- Click the correct suggestion
- Select
Add username to this repository
- Repeat for all group members
Crew
- Refresh the page and go into the new repository. If you don’t see it, check your mail and accept your invitation to join the repository.
Team
- Click here to go to the course RStudio cloud server and login
- In the upper right corner, where it should say
r_for_bio_data_science
, click - Choose
New Project...
- Select Version Control
- Select Git
- Under
Repository URL:
, enterhttps://github.com/rforbiodatascienceYY/groupXX
, where again you replaceYY
with year and replaceXX
with your group number, e.g.02
. - Under
Project directory name:
, enterlab07_git_exercises
- Under
Create project as a subdirectory of:
, make sure that is says~/projects
- Click
Create Project
Congratulations! You have just cloned your first GitHub repository!
In your Files
pane, you should now see:
…and now in your Environment
pane, you should see a new git
tab:
If you click the git
tab, you should see:
Setting up your Credentials and Personal Access Token (Team)
GitHub doesn’t know who you are, you’re just someone who cloned your first GitHub repository and now you want to do all sorts of stuff! We can’t have that, so let’s fix that GitHub doesn’t know who you are!
- In the
Console
, run the command:
- In the
Console
, run the command:
::create_github_token() usethis
- Login
- Under
Note
, where it saysDESCRIBE THE TOKEN'S USE CASE
, delete the text and write e.g.R for Bio Data Science lab 7 git exercises
- Under
Expiration
, where it says30 days
, change that to90 days
- Do not change any other setting, but simply scroll down and click
Generate token
- A new
Personal access tokens (classic)
page will appear, stating your personal access token, which starts withghp_
, go ahead and copy it - Store the
PAT
somewhere save, e.g. in a password management tool. It will serve as your password when connecting to the repository via R. - Again in the console, run the below command and enter your
PAT
when prompted for the password:
::gitcreds_set() gitcreds
Stop, wait a minute and make sure, that ALL group members are at this stage of the exercises.
Making your Credentials and Personal Access Token stick around (Team)
The thing is, we’re actually working on a Linux server here, this entails a that you’re credentials and PAT exists in a cache, which is cleared every 15 minutes. This. Is. Annoying! So let’s fix it!
First of all RStudio is a (G)raphical (U)ser (I)nterface, meaning, that it allows you to do pointy-clicky stuff, but at the end of the day, everything happens at the command line. That also goes for git, so your button-pushing simply gets converted into commands, which are executed at the command line. If you are not comfortable with the command line, don’t worry, the pointy-clicky stuff will be sufficient for now! If you are comfortable with the command line, I advice to get comfortable with git on the command line as well, you do get some extra bang-for-the-buck!
Now, a brief visit of the command line:
- In the
Console
pane, click theTerminal
tab, which gains you access to the system underlying RStudio - You should see something like
user@pupilX:/net/pupilX/home/people/user/projects/lab07_git_exercises$
- Enter
ls -a
, this willl
ista
ll the files, compare with yourFiles
pane, you will see that RStudio is “hiding” something from you, namely the.git
folder - Enter
ls -a .git
and you will see the content of the folder, this is where all the git-magic happens - See the config file? That holds the configuration, enter
git config --global --list
- Recall you entered your GitHub username and mail? This is where it ended up! Note the
credential.helper=cache
, which tells us, that the credentials are being cached. Now, entergit config --global credential.helper 'cache --timeout=86400'
. Hint: You can paste with Ctrl-Shift-V in the terminal on Windows. - Re-run the command
git config --global --list
and confirm the change - Go back to your
Console
and run the commandgitcreds::gitcreds_set()
. Select option 2, and re-enter your PAT. - If you didn’t get a list of options in the prior step, restart your R session and retry.
Congratulations! You have now used the command line and you will forever be part of an elite few, who know that everything you see in hacker-movies is BS, except perhaps for Wargames and Mr. Robot! Also, your PAT will now be good for 24h
Your first collaboration
Team
- Create a new Quarto document, title it
student_id
and save id asstudent_id.qmd
, wherestudent_id
is your… You guessed it! Make sure that “Use visual markdown editor” is checked. - In the
environment
pane, click thegit
tab - Tick the 3 boxes under
staged
- Click
commit
- In the upper right corner, add a
Commit message
, e.g “First commit by student_id” - Click the
Commit
button - A pop-up, will give you details on your commit, look through them and then click
Close
- Now, very important ALWAYS click the
Pull
button BEFORE clicking thePush
button - Clicking
Pull
, you should see “Already up to date.” - Then click
Push
- Q4: Discuss in your group what each of the steps did and why they are performed?
Your next collaboration
That first collaboration was easy right!? Well, you were all working on different files…
Captain
- Create a new Quarto Document, title it “Group Document” and save it as
group_document.qmd
- Using markdown headers, create one section for each group member, incl. yourself. Here, you can use your names, student ids or whatever you deem appropriate
- Again in the
Environment
pane, make sure you are in thegit
tab and then tick the box next togroup_document.qmd
. If you do not see the document in this window after saving, refresh the browser and it should appear. - Click
Commit
- Add a commit message, e.g. “Add group document”
- Click the
Commit
button - Click
Pull
(You should be “Already up to date.”) - Click
Push
- Click
Close
- Again, go to the group GitHub and confirm that you see the new document you just created
Team
- Click
Pull
and confirm that you now also have the filegroup_document.qmd
- Open
group_document.qmd
and find your assigned section - In your section and your section only, enter some text, add a few code chunks with some
R
-code - Make sure to save the document
- Now again, find the
group_document.qmd
in thegit
tab of theEnvironment
pane and tick the box underStaged
- Click
Commit
- Note how your changes to the document are highlighted in green
- Add a commit message, e.g. “Update the STUDENT_ID section” and click
Commit
- Click
Close
- Click
Pull
- Click
Push
- Go to the group GitHub, find the
group_document.qmd
and click it, do you see your changes?
Once everyone has added to their assigned section, everyone should do a pull/push, so everyone has the complete version of the group_document.qmd
.
Your first branching
Ok that’s pretty great so far - Right? The thing is… Consider, the ggplot2
repository, that you found in T1. Thousands of companies and even more thousands of people rely on ggplot
for advanced data visualisation. What would happen, if you wanted to add a new feature or wanted to optimise an existing one, while people were actively installing your package? They would get what stage your code was in, which may or may not be functional - Enter branching!
Below here, is an illustration, consider the Master the stable version people can download and use and then Your Work will be the feature or update that you personally are working on. Someone Else’s Work will be another team and that persons work on a new feature or update. Before the last green circle, note how both the Your Work- and Someone Else’s Work-branches are merged onto the Master-branch.
There can be even more branches
- Q5: Again, find the
ggplot
GitHub site and see if you can find how many branches there are?
Team
- Go to your RStudio session and in the
git
tab of theEnvironment
pane, clickNew Branch
- In branch name, enter your student id
- You should see a pop-up
Branch 'STUDENT_ID' set up to track remote branch 'STUDENT_ID' from 'origin'.
, go ahead and clickClose
- Next to where you clicked
New Branch
just before, it should now saySTUDENT_ID
, click it and confirm that you seeSTUDENT_ID
andMain
- Look at the illustration above and make sure you are on par with that from the original branch
main
, which is equivalent with Master, you have created a new branchSTUDENT_ID
, which is equivalent to Your Work - Click
STUDENT_ID
and you will get a confirmation that you are already on that branch and that you are up-to-date, clickClose
- In the
group_document.qmd
under your section, using markdown, enter a new sub-header and name it e.g.New feature
orNew analysis
. Make sure that you all use the “Visual” editor for the .qmd file! - Enter some text, a few code chunks with a bit of
R
-code of your choosing and save the document - Again, in the
git
tab, tick the box under staged and clickCommit
- Add a commit message, e.g.
New feature from STUDENT_ID
and clickCommit
- Click
Close
,Pull
,Close
,Push
andClose
and then close the commit window - Go to your group GitHub and confirm that you now have at least 2 branches
- Make sure you are in the main branch and then click the
group_document.qmd
, you should now not see your new feature/analysis that you added - On the left, where it says
main
, click and select yourSTUDENT_ID
, you should now see your new feature/analysis that you added
Congratulations! You have now succesfully done your first branching!
Your first branch merging
Team
- Go to your groups GitHub page, at the top it should say
STUDENT_ID had recent pushes...
, click theCompare & pull request
- Your commit message will appear and where it says
Leave a comment
, add a comment like e.g.I'm done, all seems to be working now!
or similar - Click
Create pull request
- It should now say
This branch have no conflicts with the base branch
, confirm and clickMerge pull request
- Click
Confirm merge
after which it should now sayPull request successfully merged and closed
- You have now fully merged, so go ahead and click
Delete branch
- Revisit the previous illustration and compare with your branch workflow, make sure that everyone in the groups are on par here
- Finally return to your RStudio session and make sure you switch to the
main
branch
Congratulations! You have now successfully done your first branch merge!
But wait, what was this Pull request
??? What you actually did, was: 1. Created a new branch 1. Completed a new feature/analysis 1. Push’ed the new feature/analysis to GitHub 1. Created a Pull request
for merging your branch STUDENT_ID
into the main
branch 1. Approved and completed the Pull request
Think about e.g. again the ggplot2
repo, if anyone could create a new branch and then do as described above, then there would be no way of making quality control. Therefore, typically such pull requests will have to be approved by someone. This can either be someone who is close-to-the-code e.g. in the case of ggplot2
, that’d be someone like Hadley. In a company, then that might be some senior developer approving junior developers pull requests. At one point you might have seen something like “The main branch is unprotected”, this is exactly that!
Your first merge conflict
Okay, that’s all good an well. Seems easy and straight forward, right? Well, that is as long as we don’t have a conflict
. A conflict is when two or more changes to one file are not compatible, consider:
Top 10 Bio Data Science Languages of ALL Time:
10. It is
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes 1. R
In fact, let’s screw things up!
Captain
- Go to the RStudio session and where you usually click
Quarto Document...
, this time let’s just create a simpleText File
- Copy/paste the
Top 10 Bio Data Science Languages of ALL Time
into the file - Save the file as
best_ds_langs.txt
- Make sure you are in the
main
branch - Do a commit/pull/push and check the file ended at your group github
Crew
Rrrrrrr… Let’s commit mutiny! (Double pun intended)
- Make sure you are in the
main
branch and then hitPull
- Open the
best_ds_langs.txt
- Each of you separately (wrongfully) replace
R
with an (inferior) bio data science language of your choice (you can really write anything) - Now, this time we do not pull first, simply commit and then push
- Now you can
pull
- You will now get notified about a
merge conflict
- Close and re-open the
best_ds_langs.txt
file, it should look something like:
Top 10 Bio Data Science Languages of ALL Time:
10. It is
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes
<<<<<<< HEAD
1. C++
=======
1. Python >>>>>>> 85283ca3246b7f7462ab633085f92ac5f173d3e7
Captain
Get your Crew in order!
- Simply edit the file so that you delete everything below 2. and add the final line, which emphasises that
1. R is superior!
- The box under
Staged
looks a bit odd right, just click it - Click
Commit
and add a commit message, e.g.Fixed merge conflict, got crew back in order!
- Click
Commit
,Close
,Pull
,Close
,Push
,Close
and close the commit window - Return to your GitHub group and confirm all is in order and you can continue with confidence knowing that
R is superior!
(Check the file to make sure)
Congratulations! You have now fixed your first merge conflict!
Lastly, click the History
button next to the Push
button and explore the history of how your branches and commits have changed through these exercises.
The .gitignore file
You may have noticed the .gitignore
file?
One crew member
- Go to your RStudio session, find the
.gitignore
file and click it
It contains a list of files and folders, which should not end up at your git
repo, i.e. files which should be ignored. An important aspect of working with GitHub is that GitHub is meant for code, not data! Let’s say we had a data
and a data/_raw/
, we could add those to the .gitignore
file to avoid the data being included in our commits. Currently, files like .RData
and folders like .Rproj.user
should be listed in the .gitignore
file, as these are unique to each of your own sessions and shouldn’t be committed to the repo.
- Update the
.gitignore
file and get the updated version to GitHub
Summary
You have not gotten to play around with collaborative coding using git
- Well done!
If you are more curious for more, please feel free to play around with a new file, edit commit/pull/push etc. Perhaps also take an extra look at the GitHub, explore and learn more 👍
GROUP ASSIGNMENT (Important, see: how to)
First of all, here we have just scratched the surface. You can read much more here.
Your assignment this time, will be to:
- Go to this post on “PCA tidyverse style” by Claus O. Wilke, Professor of Integrative Biology
- Again, create a reproducible micro-report together, this time, where you do a code-along with either the data in the post, the
gravier
data, and/or any other dataset. - Use your GitHub group repository to collaborate - you should all contribute. Again, make sure that you are using the “Visual” editor while working on this!
- Your hand in will be a link to your micro-report in your GitHub group repository
- Note, focus here is on the
git
learning objectives and doing a code-along, which is not a copy/paste of the code, but using it as inspiration to create your own nice concise tidy micro-report! - Make sure to check the Assignment Guidelines
- And also follow the Course Code Styling
- HOW TO HAND IN: Go to
http://github.com/rforbiodatascienceYY/groupXX
, replaceYY
with year andgroupXX
with appropriate group number, then copy the link and paste it into an empty text (.txt) file. Hand in this text file. No need to zip your html file, since we are accessing it through the repository!