# Some Advice for Beginning Statisticians

It is completely natural that it requires considerable effort for a student to transition from doing homework to solving real life problems. In a typical methods class, students aren’t exposed to unstructured problems or to problematic data. In fact, the typical method of instruction pretty much guarantees that you will make some rather predictable mistakes when you transition to unstructured problems. Here are some suggestions to get a faster start at being a successful statistician.

**1. Don’t assume that the data is correct or complete or even appropriate for the question the researcher wants to answer.**

When you do homework, the data is “correct” for the type of analysis being studied. It has almost always been cleaned, edited, or modified so that the focus of the homework is on the statistical methods being taught. This is highly efficient for learning statistical methods. Don’t assume that the real world is like this.

Be aware that your client’s interest level and enthusiasm are going to shortly after their presentation. Take advantage of this to be sure that you have the data, that it is complete, and that you have been able to ask important questions while they are still interested. I recommend doing the firt part of the data as soon as possible, even thought the statistical analysis might happen at a later date.

**2. The client almost always has an idea of how they want to analyze the data. Don’t assume that they are right.**

The client can almost always be counted on to have the big question correct. Generally, they aren’t going to be trained in how to match statistical methods to the combination of data and question being asked. Be prepared to guide them gently in the correct direction. In order to do that, don’t jump too quickly to any specific analysis method until you understand clearly what is going on.

**3. Experimental design matters.**

“Every experiment has a design. Some designs are better than others.” (I attribute this to George Box.) Note that is applies to observational studies, not just planned experiments. How the data is collected determines it’s usability and correctness. Sadly, you probably have gotten no training in experimental design as part of your coursework. If you are doing general consulting, you will eventually become familiar with factorial and fractional factorial designs, survey designs, observational study designs, etc. In the meantime, take a moment to think about how the data was collected before proceeding.

**4. Beware of categories, classes, groups, discrete variables, etc.**

Many studies will have one or more variables with values that represent categories or groups. Identify these variables as quickly as possible as they are almost always related in a fundamental way to the question that the client is asking. Think carefully about how the presence of such variables is going to affect your analysis method.

It is almost always fundamentally important to make box plots of every response variable against every categorical, class, or grouping variable. Do this before fitting any models. If you see something in one of these plots, it will be encouraging and can guide further analysis. If you don’t see the differences the client expects, you may save your self a lot of fruitless effort on analysis.

**5. Sample size matters for more than just power. Beware of imbalances.**

In general, we think about sample size as affecting power. When there are groups or categories, the sample size in each group can be important. In particular, if there is a big imbalance, and one group of interest is very small compared to the rest, then you may need to adjust results, methods, or expectations. This can be especially problematic in classification problems.

**6. Look at your data carefully before fitting any models.**

A general bit of wisdom is that about 80% of your effort is going to be cleaning, editing, manipulating, and understanding your data. The remaining 20% will be spent on analysis and report writing. It is impossible to teach this in a class, so expect to learn it from experience. Few approaches are more likely to result in failure than a rush to modeling before understanding the data. This includes, examining the data for sample sizes (imbalance among groups), outliers, and distributional properties. Think about the assumptions of the analysis you have planned and verify that the data is appropriate.

Don’t be afraid to ask the client for clarification. Be prepared to revise the analysis plan once you have a deeper understanding of the data.

**7. Use transformations wisely.**

Students often make the mistake of thinking that the response variable must be normally distributed. The correct understanding is that many analysis methods need the residuals to be approximately normally distributed. If the response variable is highly skewed, that does usually suggest a transformation of the response. But otherwise, it’s the residuals that you care about. Note that the response is affected by the predictor variables in many complex ways so the residuals may be fine even though the response is, say, bimodal.

Consider whether or not using a different error model is more appropriate than transforming the response. For example counts, can be fit using models that assume a variety of distributions, some of which will account for unusual features of your data such as too many zeros or more variability than expected. Figuring out the right error model can be an important part of your analysis.

When you are transforming a predictor variable, your goal is typically to reduce the impact of influential observations. Normality of the predictor variables is seldom an assumption of a method.

Remember that any time your transform any variable, response or predictor, you are changing the model. So think and use caution.

**8. Learn about ANOVA and contrasts.**

If you are working with laboratory data, the client wants to know the effect of a planned intervention, or the client wants to compare known groups, you are almost always going to be in some version of ANOVA. Don’t be afraid to use ANOVA, even if it is in the form of a regression model. One of the key aspects of ANOVA is the ability to set contrasts to most efficiently address comparisons.