Readable, Reproducible Analysis | Perry Haaland Faculty Page

Motivation

Writing code is an essential part of Data Science. In a modern workplace, to be successful, a statistician or data scientist will need to write code that is

readable by others especially other team members and leaders,
ays out the logic of what is being done in plain language,
can be seen, via inspection of results and checks in the code, to be correct,
can be revisited in a meaningful way at some point in the future, and
can easily be rerun to verify that the analysis can be reproduced.

Students may be confused by the difference between writing code for yourself and writing code for others. Writing code in a way that supports team success go considerably beyond writing code that only goes on a direct path to get the right answer. Getting a “right answer” that can’t be reproduced or explained to others, will not advance your career.

Notebook formats such as R Markdown and Quarto allow you to integrate text, code, comments, figures, and statistical results. These capabilities are the foundation of reproducible analysis. In particular, it is much easier to know what was done, why it was done, how it was done, and what the results were. If you are working in a group, this level of information and documentation facilitates collaboration.

Most important projects in industry get revisited from time to time. So if you had an important result, it may need to be updated or revised 6 months or a year after the initial completion. The original analyst may have moved on to another position or have been focused on other projects for some time. Therefore, it is important for you and others to be able to look at the code and results at a later date, this approach makes it much easier for someone unfamiliar with your code to understand what you did. (You will eventually become unfamiliar with your own code, too!)

*When you are working on a team, there is typically a manager who will want to periodically review your code and results. You will have much better relationships with that person if your work is easy to read and understand and your code is well documented.

Best Practices for Readable, Reproducible Code

Describe what you are doing in a way that can be read in plain English without having to puzzle over the code.
- Make code as easily readable as possible through the use of accepted style for variable names, line lengths, indentation, pipes, etc.
- Comments about clever coding that help the reader understand the code itself, should go inside the code blocks.
- Comments about motivation, goals, or results should go in the text outside the code block.
Each code block should be preceded by one or more paragraphs that explain what is being done, why it is being done, and how to interpret the results.
- Life is a lot simpler if each code block does one task or produces one output figure.
After you show a result, it is often helpful to describe in simple terms why it motivates what to do next.
Comment on every figure, table, or statistical output that you include. You can keep it brief, but be sure you have some discussion.
- Adjust figure sizes, text sizes, and labels on figures to be sure they will be legible.
- It is generally a good idea to add specific labels for x-axis, y-axis, etc. as standard R variable names can be a bit of a burden to someone who is looking at the figure.
- RStudio has a spell checker. Use it!
Use common sense approaches to add checks on the correctness of your code as you go along.
- Periodically check dimensions, ranges of values, and anything that would be noticeable if a mistake were made. Show the results in the notebook.
- If you are making a temporary change to a data set, it is good practice to code that change in an explicit pipe command, that then pipes to the model or graph. Try to minimize the proliferation of temporary data sets.
- If you are making a permanent change to a data set, use a new name that will help you remember which dataset to use in the future.