How many times have you been asked to be data driven in your decision making? Often this leads to the collection of a lot of data through market research or observational studies with the hope a pattern will emerge to explain what is happening.
In my experience, one of the favorite analytical approaches used by Agencies to make sense of large amounts of data is regression analysis. They drop the data set into a statistical meat grinder and hope sausage comes out the other end. In presentations, you will hear phrases like “high degree of correlation” or “high R squared” to impart some authority to whatever action the Agency is recommending you take. However, while it may feel comforting to have a statistical basis for making a decision, it can also be highly misleading if you don’t spend time to lift the hood and really understand what is going on.
In any regression analysis, the objective is to infer a relationship between a dependent variable and a set on independent variables. In plain English, you often want to predict the outcome of one dependent variable given the result of another (or several) independent variables. Consider this over simplistic example. Your manufacturing line produces 1 defective product for every 10 made. This situation is costing you money and you want to know how to reduce waste and lower your throw away cost. Your line manager believes the speed of the line is contributing to the problem. Over the next month, data is collected on acceptable units made (dependent variable) and line speed (independent variable). Your line manager runs the data through a regression analysis and finds a strong correlation between line speed and rejects. You now have the data to help you decide what the optimal line speed is. Seems simple enough.
But, one of the often overlooked aspects of the example is the strongly suspected (or known) relationship between line speed and manufacturing defects. Your line manager’s experience played a key role in selecting an independent variable that made practical sense. Your line manager knew line speed could have an impact and the data confirmed it did.
This leads to one of the most confusing and overlooked concepts for non-statisticians – the difference between correlation and causation. In the above example your line manager had a basis for believing there was a connection between line speed and quality. For many situations there is uncertainty as to how the variables you are looking at are related. And, that uncertainty is overlooked because of a lack of understanding the potential short comings of regression analysis or simply the pressure to present data to justify a recommended action. However, if you are in a position to make decisions, it is imperative you appreciate the difference.
For example, a community was experiencing a measurable increase in the number of children drowning in a local river. This was obviously an unacceptable situation needing to be addressed. Nobody had a clue what was causing this problem. So, the community leaders commissioned a study to analyze the situation. In this analysis, the number of drownings was the dependent variable. The search was on to discover an independent variable that might be helpful in predicting the number of drownings that could be expected. Scads of data were subjected to a regression analysis. The result was ice cream sales and the number of drownings were highly positively correlated. But, of course ice cream sales do not cause drownings. They are correlated because both increase in the summer.
I did a little online research to find other examples to drive home the point correlation doesn’t necessarily mean causation and you need to use common sense when looking at the results of regression analyses. Here are a few more examples I found of variables that are strongly correlated but do not imply causation:
- Foot size and IQ. The fact is, as you get older your get grow in size and you get smarter.
- The number of storks and birth rate in Denmark. Despite the fables, storks do not bring children.
- The number of priests and alcoholism. Clearly priests are not encouraging alcoholism.
- The number of homeless and crime rate. The real independent variable is likely the unemployment rate or drug abuse rate.
- The more firemen sent to a fire the greater degree of damage. Obviously bigger fires require more firemen and result in greater damage.
The key take away is that being aware of the underlying factors is critical before concluding that in the case of two correlated variables a conclusion of causation can be made (or even implied). Take the time to understand the practical relationship between variables. Be convinced there is a practical reason for why causation may exist.
I appreciate the examples provided could be looked at a trite or obvious. But, often you are faced with a situation where the problem you are studying is not well understood and is is seductive to rely on correlation to justify causation rather than doing the hard work of understanding how. Do the hard work and you will make better decisions.