Chapter 1 Why GitHub

git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git.

Ruby is looking at her computer with a lot of folders with different variations on similar names. Ruby asks herself: Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?

All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy.

There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files.

1.0.1 GitHub and git allow you to…

1.0.1.1 Maintain transparent analyses

Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time!

Ruby’s computer shows a virus and has a temperature. Ruby says ‘Oh no! I lost data on my computer! Good thing all the work I have toiled on for years is on GitHub!’ The GitHub cat is in a cloud with a download sign with Ruby’s code.

1.0.1.2 Have backups of your code and analyses at every point

Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses.

Ruby’s computer shows a virus and has a temperature. Ruby says ‘Oh no! I lost data on my computer! Good thing all the work I have toiled on for years is on GitHub!’ The GitHub cat is in a cloud with a download sign with Ruby’s code.

1.0.1.3 Keep a documented history of your project

Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered.

Ruby holds a magnifying glass and says 'Why did we write the code this way? I don’t remember… Good thing through git tracking I can look into this file’s history and remind myself how it became this.'

1.0.1.4 Collaborate with others

Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way.

Ruby and Avi are both working on the code. Because they are both using git version control, they are able to merge their changes to the code base. And now the main code base contains both of their changes!

1.0.1.5 Experiment with your analysis

Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture.

Ruby says ‘I’m not sure if this side analysis I’m working on is a good idea or not, but I want to test it. Good thing I can make a separate branch and keep my original code safe from my experimenting.’ Her computer shows her main code and a branch off of it that says ‘test analysis’. After time and work goes by she may decide to incorporate her test analysis with her main code