Skip to main content

I view Data Science as a general description of what a statistician does who is solving problems for clients. There are many finer points to be made about what are the differences between Data Science and Statistics. In the information that I am going to present on my website, I am not going to discuss specific statistical methods, but rather the tools, methods, and processes that a team of data scientists would use to work together to solve a complex problem.

This approach is founded on the principle that each data scientist/statistician, whether working alone or as a member of a team, will benefit by applying best practices to their analysis strategy, their coding strategy, and their organization of files, folders and data. For the individual, the use of best practices increases their chance of success and reduces the likelihood of making mistakes. For the team, the use of best practices makes it easier for team members to work together while partitioning out parts of the problem and integrating the parts into a successful whole.

Best practices start with a systematic understanding of how to think about organizing and solving complex data science problems. That provides the context for the use of specific tools such as GitHub, R, Python, notebooks, etc. A systematic approach to organizing code helps students avoid common “rookie” errors such as fitting complex models before having carefully explored their data. Guidelines for the use of notebooks (such as R Markdown/Rmd or Quarto in RStudio) to develop code and analysis, helps students make the transition from class room learning to tackling unstructured, open ended analyses. Best practices of version control (GitHub) help students think about delivering a work product that is systematically developed in the context of a team effort. These are all essential skills for success as a statistician or data scientist.