Skip to main content

STOR 765 is a statistical consulting class taught by Professor Steve Marron. I am assisting with the class. Here is a rough outline of a “project”. Although this information is tailored to the student’s experience in STOR 765. This is also a generally useful outline of how a data scientist would engage with a client.

1. Initial client presentation:

The potential client visits the class, and presents their problem. This typically includes, background, goals, and a description of the data. Students ask questions to clarify the problem. An extended part of of the discussion is often devoted to possible means of analysis. Students are expected to participate in each discussion. Important parts of the discussion focus on choice of software, availability of the data, and desired completion date. Professor Marron determines whether or not the problem is suitable for a class project or not. Based on expressions of interest from the students an assignment is made.

2. Initial meeting with client and student:

Professor Marron meets with the student and the client. The problem is reviewed and an analysis plan is developed. The highlights of the conversation focus on the following:

a. Review of objectives

b. Data availability, access, and organization. Experimental units (subjects or patients) are typically rows and measured variables are in the columns. The independent variables (experimental conditions, demographics, etc.) are also referred to as predictor variables. The dependent variables are the responses which typically represent measurements or outcomes. A basic understanding of how many of each type and how they are represented in the data is gathered.

c. Initial analysis plan. It is a truism in data science that about 80% of the work on any project is getting the data organized and being sure that it is correct. The first client checkin would typically be after transfer of the initial data and preliminary analysis. This is typically a univariate examination (one variable at a time) to look for distributional properties (Gaussian is good, and transformations to approach Gaussian should be investigated), errors in the data and outliers (it is not always easy to determine which is which), and multimodal distributions that suggest subgroups in the data. In addition to cleaning the data, this is a chance for the student to be sure they are in alignment with the project goals and to verify that they understand how the analysis results align with those goals.

d. Statistical analysis plan. This part of the analysis may involve complex statistical methods, but the ability to visualize the results in a way that is easily understandable is a key to success. There are several general types of analyses that are appropriate depending on the problem. These may include regression analysis, analysis of variance, cluster analysis, discriminate analysis. Each of these areas are very broad and include many specific methods. It seems sensible for the student to check in with the client periodically during this phase of the work. This will insure that there is alignment between the questions asked and the results delivered. It is also good to remind clients that the students job is not to find a significant result, but rather to determine whether there is a significant result.

3. Student analysis

4. Report and presentation.