‘How to Lie with Statistics‘ explains how people use numbers to fudge. Darrell Huff, a former statistician, published the book in 1954. It was quite path-breaking at that time, since statistics was not taught in schools. In the modern world, where schools teach statistics, most people are aware of these deceptions.
This book would make most sense for people who have not been exposed to statistics.
I have tried to enliven this up by providing some examples (which are not the same as the ones given in the book)
Read it if you have an hour to kill.
None of the things mentioned in the book were new to me unfortunately. But then I have learnt stats, at a post-graduate level. If you have forgotten the basics, this book has some value for you.
Proper treatment will cure a cold in seven days, but left to itself a cold will hang on for a week.
The Sample with the Built-in Bias
- When someone says that this group has a value of X, check for bias in the samples.
- Generally, it is very difficult to go out and check every value. So we used samples. Samples have biases which skew the findings.
- A truly random sample is very hard to create
- Notorious example: Generally, colleges (mostly the MBA kind) tend to report that their alumni have great salaries. The problem is that their alumni would have responded with an inflated salary because of peer pressure or a lower salary for tax reasons or only those people who are successful might be part of the alumni network itself.
The Well-chosen Average
- An average can be
- Mean – this is the arithmetic average
- Median – this is the middle most value
- Mode – this is the value which separates the top 50% from the bottom 50%
- In a bell curve (heights of human beings for example), the mean = median = mode
- In a skewed curve (salaries of employees for example), the three values will very different
- Notorious example: Salaries. Typically, companies will advertise a salary range. Unfortunately this range is not representative of what you will actually get. There could be a 100 people getting a low salary while 5 get enormous ones. If the mean is used to report the salary, it will give a false indication of what you can actually expect.
The Little Figures that are Not There
- Sample sizes have to be statistically significant i.e. large enough so that sample is representative of the whole.
- Notorious example: Toothpastes. 8 out of 10 dentists recommend our toothpaste. Of course they will, if you have picked and chosen 10 dentists only.
Much Ado about Practically Nothing
- Every measurement from a sample has an error
- Probable error: what is the error value where you are more than it in 50 % of the time, and less than in 50% of the time.
- Standard error : Same as probable but with a 1/3 and 2/3 split.
- Notorious example: An IQ test (with a probable error of 3) is administered to Ram & Shyam. Ram scores 98 and Shyam scores 101. Does this mean that Shyam is more intelligent than Ram? (leaving aside the efficacy of IQ in measuring intelligence). No. Ram’s IQ score is 98 which means that there is a 50% probability that his score is between 95 (98-3) and 101 (98+3). Shyam’s score means that there is a 50% probability that his score is between 98 (101-3) and 104 (101+3)
The Gee-Whiz Graph
- The scales of a graph determine how the data will be interpreted
- Notorious example: assume that over the course of 20 years, the national income went from Rs 2000 to Rs 2500. Assume that y-axis of the graph is the income and the x-axis is the year. If the graph had an axis where the origin was 2000, it will appear as though the salary has grown 25 times in the 20 years. If the axis instead had the origin at 0, the true picture will come out.
The One-dimensional Picture
- Be wary of graphs or pictures where a one dimensional value is shown in n dimensions to convey increase or decrease.
The Semi-attached Picture
- If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing
- Notorious Example: Soap. Our soap can kill 99% of micro-organisms on your body. So what? Are these harmful? What about your competitors?
Post Hoc Rides again
- Correlation does not imply causation
- Notorious Example: Drinking cold water makes you get colds. Why do people think this? Because they have seen a few folks drink cold water and immediately get a cold. Here the correlation is between the cold water and getting a cold. But the cause is actually the presence of cold viruses.
Key Questions to Ask When Presented with Data
- Look for conscious & unconscious bias
- Is the sample statistically significant?
- Is there really a correlation?
- What is missing? (error values etc.)
- Check if the data says something & the conclusion is something else
- Does it make sense?
Also published on Medium.