For this worksheet we will be practicing using git/GitHub for collaborative code development. We’ll be using your final project repo as an example. Specifically, we will be adding all of your names to the authors section, and dealing with the merge conflicts that will cause.
For this lab, assign each member a number 1-4 (or 3). I will refer to steps “Member 1,” “Member 2,” etc. should take throughout this lab. Be sure not to skip ahead or do steps out of order, or the problems you encounter won’t be the ones I wrote this guide to help with.
To start with, have each team member open the R Studio project for your final package and the page on GitHub of your repo.
Today we’ll be using git as a team for the first time. git requires you to be very exact, and that can sometimes be tedious. However, git is simply the best method to collaboratively code that is widely used. It is used in business, government, research, and anywhere else significant coding is happening. You need to understand these tools to be an effective team member in the data science world.
Not only does using git to collaborate confer all the benefits of using git alone–such as version control and backups–it will also make working on the same code base much easier than alternative methods. Because git tracks code files on a line-by-line basis, it is possible for multiple people to work on the same files simultaneously with limited possibility for issues.
Issues do still happen though, and today we will force some of those issues so we can practice resolving them.
To start, we are all going to add a new file to our project repo. You
can imagine this is any kind of file you would like–code, help files,
etc.–the process is the same. On each of your computers, create a new
folder called “scrap.” Make sure the name is all lower case, in the
folder is in the base directory of your project. Inside that folder,
create and save a new R Script which contains some print()
statement.
It doesn’t matter what the statement says.
Confirm all members have created, saved, and committed this file on their computers before moving on.
At this point Member 1 should push their changes. This will add their file to the GitHub repo. Go look on your project repo GitHub page to see the new folder and file appear. You will need to refresh the page.
Now, Member 2 should try to push their changes. They will get an error saying something along the following:
! [rejected] HEAD -> main
error: failed to push some refts to <YOUR REPO>
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g. 'git pull ...') before pushing again.
This happens, as the hint says, because there are new files on GitHub you do not have, so git won’t let you push. To resolve this, we can follow the hint’s advice and do a pull. This will update your local files with Member 1’s work, and now Member 2 should be able to push.
Repeat this process until all members have all other members’ files on their computer.
In the previous section, we were creating files which we then pushed to
the main
branch of our project. This is not good practice. You
should think of main
as the canonical, pristine version of your code
base. Put simply, only things that have been examined and tested by
multiple team members should be included in main
.
The proper way to conduct work on a shared project is within a branch. Recall our branch diagram from class:
When you create a branch, you split your work off into a separate little
universe. No work you do here will change what main
is like. This
means you can experiment on your branch, try things, even break things,
and everyone else can keep using the clean code on main
.
To create a new branch, you can look to the upper right corner of the git pane in R Studio. You’ll see a button that says “New Branch,” click on that and you will see the following pop up.
Name your branch “<YOUR NAME>_worksheet”. Make sure that “Sync branch with remote” is checked and click the “create” button. You will now be on your own branch within the project!
Whatever changes and commits you make will now be saved to this branch,
rather than main
. Stay on this branch until I tell you to switch back.
While working on separate files is fine, if you are working on a project together, you probably want to be able to edit the same files. git is great for this usually, but sometimes things can go wrong. We’re going to practice this now.
Each member should open the DESCRIPTION
file of their final project
repo in R Studio. You will notice there is a section called Authors@R
.
Whatever is included here will be understood as the creators of this
package. We want to add each of your names. On each of your computers,
follow the template and add your first and last names, along with email
to the file. Commit your changes to your branch and push.
You won’t encounter any problems at this point because you are all on your own branch, thus not modifying the same files. This is also true if you were working on the same R scripts!
Once you are happy with the work you’ve done on your branch, you can try
to merge your work into main
. This will take your changes and make
it part of the canonical code that everyone will work from. The process
of merging your code is done through a pull request on GitHub; named
such because you request your changes to be pulled into main.
Whenever you push your branch, you can refresh the GitHub page for you repo, and you should see an alert like the following appear:
Every member should click on the “Compare & pull request” button. If you missed this notice, you can also click the button that says “Branches”, find your branch, and create a pull request there.
Creating a pull request will look like the following with these elements:
Once you create your pull request, the page will change slightly, and will look the like following:
All the things you entered previously will be there, and you can modify them if you want. This screen also has a few new features.
main
.Each member should go to the pull request page for their branch, and assign another member on the team as a Reviewer. You will get an email from GitHub once you have been assigned. Click on the link in that email to go to the branch in which you are a reviewer, then go to the “Files changed” tab.
On this tab, you will be able to see all of the files the person has changed in your team repo. It will look similar to the following:
Every file will be listed. If a new line was added to that file, it will be highlighted in green. If a line was removed, it will be highlighted in red. For every line, you can click on the plus sign next to the line number to leave comments to the author. Once you have made all your comments, you can scroll back to the top, and in the upper right, click the “Finish your review” button.
The author will be notified of your comments, and can work to address
them. They can make the changes on their computer, than push them, and
the pull request will be updated. In this way, teams can collaboratively
work on code before it gets integrated into main
. Try this process
of leaving comments to each other out now, but do not confirm the pull
request yet.
Sometimes there will be conflicts, aka two people worked on the same file in a way they git can’t combine by itself. When this happens, it will need you pick the “real” version of the code.
Member 1 should go to their pull request, and
click on the “Merge pull request” button. This will add their name to
the DESCRIPTION
file on main
.
Member 2 should then go to their pull request. They will see that there is a conflict, and the following message:
It is telling you that some file you modified now has a conflict with
some file already on main. In this case, it is because now your version
of DESCRIPTION
cannot be merged with the version on main
. This is
because Member 1 added their name in the same
spot you added your name. We thus need to resolve the conflict.
We can do this right in GitHub if we want. Click on the grey “Resolve
conflicts” button, and it will take you to a page showing the source of
the conflicts. In this case, it will be the DESCRIPTION
file, and will
look similar to the following:
This page is a text editor where you can edit the files in question to
create the “real” version of the file. The red bars show you the two
conflicting versions of the code occupying the same lines; one version
before the =======
, on version after. Modify this code into a unified
version similar to the following and click the “Mark as resolved” and
the “Commit merge” buttons. Be sure to include the comma!
You have now resolved the conflict, and you can merge your branch into main!
Member 3 (and Member
4) should do the same. Everyone should then go back to R and
press “Pull” to get the changes onto their machines. Be sure to switch
back to the main
branch!
It may be a fair bit of work, but if you follow this method, teams of hundreds of people can work on the same code base at the same time. When the alternative is emailing around dozens of files, hoping you have the most recent one, you can start to see how this would be preferred. Additionally, whole process we went through today can be linked with the issue and KanBan board system we learned of previously to really help keep track of your tasks.
While it is probably overkill for a project of this size, mastering these skills will make you much more valuable in a team setting in the real world. I encourage you to try your best and treat this as training for a real world project.