Skip to main content

To be a statistician or data science is to be a software developer. The current state of our educational system, however, typically provides little software development training to data analysts. The following notes provide some suggestions for students on how to organize their code. This advice applies both to a small stand-alone consulting project as well as to individual parts of a larger team project.

For the purposes of making the file organization clearer to others and to indicate the order in which code is to be executed, I suggest a numbering and naming system in the examples given below.

1. Provide a guide to the analysis.

There should be a file with an obvious name that makes it clear what analysis is being done and in which order files are to be run. This level of organization will help you understand what you are doing. It contributes greatly to reproducible analysis. Most real world problems are revisited from time-to-time, as opposed to class project or homework assignments. Your real life clients will either curse or thank you for the clarity of your file organization.

  • 00_analysis_guide.Rmd

 

2. Separate out the creation, organization, and validation of the data from the analysis.

Reading the data and organizing it so that it is in a form that supports the intended analysis is fundamental to success. Do this as a separate process from any analysis. Rerun this code periodically to be sure nothing has changed. Provide plenty of either visual or programmatic checks on the correctness of the data such as ranges, numbers of observations in each class or group, etc. Save the data after you have thoroughly verified its correctness.

* Be sure that you are careful to use the correct data. Generally don’t modify your main data file on the fly. Make a copy that with a name change that guides the user.

* Use consistent an informative names for data frames and variables. Your client generally won’t have a good grasp of this, so be prepared to rename variable to make them more compatible with R.

* Your client is probably going to give you an Excel file. Most of the time the data will need to be reorganized so that the objects of the analysis are clear (OODA). Do as little of this work as is possible in Excel. Instead, program it in R in a way that is clear and reproducible.

  • 01_make_data.Rmd

3. Do a visual examination of your data before you start modeling.

This is a good time to assess the need for transformation. Now is when you should be making lots of box plots, histograms, density plots, scatter plots, etc. In many smaller problems, you will be able to foreshadow the results of the statistical analysis in this step.

* Spend some time thinking about the “killer” figure, that will lead off your report. This is the figure that will explain the statistical results to the client (and yourself). Sketch it out longhand on paper with the client, if you have the chance. A good conversation might go something like this: “If I understand your research question, then if we graphed the data like this, and your hypothesis is true, then the graph should look something like the following.”

  • 02_EDA.Rmd

4. Don’t let any one file get too big or too complicated.

It is easier to understand and produce correct results if files aren’t too long and if you don’t do too many different things in one file. Use your own judgement.

  • 03_model1.Rmd
  • 04_model2.Rmd
  • 05_summary.Rmd

5. Archive files as appropriate.

If you do something that was interesting but turns out not to have been as productive as you thought. Move the file to an archive directory. Do the same if you have tried several different methods, and you think you may want to go back to something later.