ラインハート ダメな統計学 (2017) 


 この本は、Alex Reinhart Statistics Done Wrong (2015) の翻訳です。















Introduction はじめに

Ch.01 An introduction to statistical significance 統計有意性入門

Ch.02 Statistical power and underpowered statistics

Ch.03 Pseudoreplication: choose your data wisely

Ch.04 The p value and the base rate fallacy p値と基礎比率誤謬

Ch.05 Bad judges of significance 有意性の間違った判断

Ch.06 Double-dipping in the data データの二度漬け

Ch.07 Continuity errors 連続性誤差

Ch.08 Model Abuse モデルの誤用

Ch.09 Researcher freedom: Good vibration? 研究者の自由:

Ch.10 Everybody makes mistakes 誰もがミスをします

Ch.11 Hiding the data データの隠蔽

Ch.12 What can be done? 何ができるか?


Introduction はじめに

Ch.01 An introduction to data analysis データ分析入門

Ch.02 Statistical power and underpowered statistics

Ch.03 Pseudoreplication: choose your data wisely

Ch.04 The p value and the base rate fallacy p値と基礎比率誤謬

Ch.05 When differences in significance aren't significant differences

Ch.06 Stopping rules and regression to the mean 停止規則と平均への回帰

Ch.07 Researcher freedom: Good vibration? 研究者の自由:

Ch.08 Everybody makes mistakes 誰もがミスをします

Ch.09 Hiding the data データの隠蔽

Ch.10 What have we wrought? 何をしてきたか?

Ch.11 What can be done? 何ができるか?


Introduction はじめに

●In the final chapter of his famous book How to Lie with Statistics,

Darrell Huff tells us that “anything smacking of the medical profession” or published by scientific laboratories and universities is worthy of our trust – not unconditional trust, but certainly more trust than we’d afford the media or shifty politicians.

After all, Huff filled an entire book with the misleading statistical trickery used in politics and the media,

but few people complain about statistics done by trained professional scientists.

Scientists seek understanding, not ammunition to use against political opponents.

●Statistical data analysis is fundamental to science.

Open a random page in your favorite medical journal and you’ll be deluged with statistics:

t tests, p values, proportional hazards models, risk ratios, logistic regressions, least-squares fits, and confidence intervals.

Statisticians have provided scientists with tools of enormous power to find order and meaning in the most complex of datasets,

and scientists have embraced them with glee.

●They have not, however, embraced statistics education,

and many undergraduate programs in the sciences require no statistical training whatsoever.

●Since the 1980s, researchers have described numerous statistical fallacies and misconceptions in the popular peer-reviewed scientific literature,

and have found that many scientific papers – perhaps more than half – fall prey to these errors.
そして、見出しました|多くの科学論文 - 恐らく半分以上 - は、これらの誤りの犠牲になっていることを|。

Inadequate statistical power renders many studies incapable of finding what they’re looking for;

multiple comparisons and misinterpreted p values cause numerous false positives;

flexible data analysis makes it easy to find a correlation where none exists.

The problem isn’t fraud but poor statistical education – poor enough that some scientists conclude that most published research findings are probably false.
問題は、詐欺ではなく、貧弱な統計教育なのです - 余りに貧弱なので、何人かの科学者は、結論します|公表された科学研究の発見の多くは、多分、誤りだろうと|。

●What follows is a list of the more egregious statistical fallacies regularly committed in the name of science.

It assumes no knowledge of statistical methods,  統計の手法への知識は必要としません

since many scientists receive no formal statistical training.

And be warned: once you learn the fallacies, you will see them everywhere.

Don’t be alarmed.  驚かないでください。

This isn’t an excuse to reject all modern science and return to bloodletting and leeches

– it’s a call to improve the science we rely on.


Ch.01 An introduction to data analysis データ分析入門

●Much of experimental science comes down to measuring changes.

Does one medicine work better than another?

Do cells with one version of a gene synthesize more of an enzyme than cells with another version?

Does one kind of signal processing algorithm detect pulsars better than another?

Is one catalyst more effective at speeding a chemical reaction than another?

●Much of statistics, then, comes down to making judgments about these kinds of differences.

We talk about “statistically significant differences”

because statisticians have devised ways of telling if the difference between two measurements is really big enough to ascribe to anything but chance.

●Suppose you’re testing cold medicines.  あなたが、風邪薬を試験しているとしよう。

Your new medicine promises to cut the duration of cold symptoms by a day.

To prove this, you find twenty patients with colds and give half of them your new medicine and half a placebo.

Then you track the length of their colds and find out what the average cold length was with and without the medicine.

●But all colds aren’t identical.  しかし、全ての風邪は、同一ではない。

Perhaps the average cold lasts a week,  多分、平均の風邪は、1週間続きます

but some last only a few days, and others drag on for two weeks or more, straining the household Kleenex supply.

It’s possible that the group of ten patients receiving genuine medicine will be the unlucky types to get two-week colds, and so you’ll falsely conclude that the medicine makes things worse.

How can you tell if you’ve proven your medicine works, rather than just proving that some patients are unlucky?

The power of p values p値の偉力

●Statistics provides the answer.  統計学は、答えを与えます。

If we know the distribution of typical cold cases – roughly how many patients tend to have short colds, or long colds, or average colds –
もし、解っていれば|典型的な風の症状の分布が - おおまかにどれだけの患者が短い風邪、長い風邪、平均の風邪をひく傾向にあるか - が|、

we can tell how likely it is for a random sample of cold patients to have cold lengths all shorter than average, or longer than average, or exactly average.

By performing a statistical test, we can answer the question “If my medication were completely ineffective, what are the chances I’d see data like what I saw?”

●That’s a bit tricky, so read it again.  ここはいささかトリッキーなので、もう一度読んでください。

●Intuitively, we can see how this might work.  直観的には、これがいかに働くか理解出来ます。

If I only test the medication on one person, it’s unsurprising if he has a shorter cold than average – about half of patients have colds shorter than average.

If I test the medication on ten million patients, it’s pretty damn unlikely that all of them will have shorter colds than average, unless my medication works.

●The common statistical tests used by scientists produce a number called the p value that quantifies this.

Here’s how it’s defined: いかにそれが定義されるかが、これです:

●The p value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.

So if I give my medication to 100 patients and find that their colds are a day shorter on average,

the p value of this result is the chance that, if my medication didn’t do anything at all, my 100 patients would randomly have, on average, day-or-more-shorter colds.

Obviously, the p value depends on the size of the effect – colds shorter by four days are less likely than colds shorter by one day – and the number of patients I test the medication on.
明らかに、p値は、依存します|効果の大きさ - 4日短い風邪は1日短い風邪よりもよりありそうでない - と、薬剤試験を行った患者の数に|。

●That’s a tricky concept to wrap your head around.

A p value is not a measure of how right you are, or how significant the difference is;

it’s a measure of how surprised you should be if there is no actual difference between the groups, but you got data suggesting there is.

A bigger difference, or one backed up by more data, suggests more surprise and a smaller p value.

●It’s not easy to translate that into an answer to the question “is there really a difference?”

Most scientists use a simple rule of thumb:  多くの科学者は、簡単なルールを使います:

if p is less than 0.05, there’s only a 5% chance of obtaining this data unless the medication really works,

so we will call the difference between medication and placebo “significant.”

If p is larger, we’ll call the difference insignificant. もしp値がより大きいと、私達は、違いは有意ではないと呼びます。

●But there are limitations.  しかし、限界があります。

The p value is a measure of surprise, not a measure of the size of the effect.

I can get a tiny p value by either measuring a huge effect – “this medicine makes people live four times longer” – or by measuring a tiny effect with great certainty.
私は、得ることができます|小さいp値を|巨大な効果 - 「この薬剤は4倍長生きさせます」 - を測定するか、または、確実性の高い小さな効果を測定することにより|。

Statistical significance does not mean your result has any practical significance.

●Similarly, statistical insignificance is hard to interpret.

I could have a perfectly good medicine,  私が完全に良い薬を持っているとして、

but if I test it on ten people, I’d be hard-pressed to tell the difference between a real improvement in the patients and plain good luck.

Alternately, I might test it on thousands of people,

but the medication only shortens colds by three minutes, and so I’m simply incapable of detecting the difference.

A statistically insignificant difference does not mean there is no difference at all.

●There’s no mathematical tool to tell you if your hypothesis is true;

you can only see whether it is consistent with the data,

and if the data is sparse or unclear, your conclusions are uncertain.

●But we can’t let that stop us. しかし、それで止めるわけにはいかない。


Ch.02 Statistical power and underpowered statistics

●We’ve seen that it’s possible to miss a real effect simply by not taking enough data.

In most cases, this is a problem:  殆どの場合、これが問題です。

we might miss a viable medicine or fail to notice an important side-effect.

How do we know how much data to collect?

●Statisticians provide the answer in the form of “statistical power.”

The power of a study is the likelihood that it will distinguish an effect of a certain size from pure luck.

A study might easily detect a huge benefit from a medication,

but detecting a subtle difference is much less likely.

Let’s try a simple example. 簡単な例を見てみよう。

●Suppose a gambler is convinced that an opponent has an unfair coin:

rather than getting heads half the time and tails half the time, the proportion is different,

and the opponent is using this to cheat at incredibly boring coin-flipping games.

How to prove it? いかにして証明するか?

●You can’t just flip the coin a hundred times and count the heads.

Even with a perfectly fair coin, you don’t always get fifty heads:

図 表の数 対 確率

●This shows the likelihood of getting different numbers of heads, if you flip a coin a hundred times.

●You can see that 50 heads is the most likely option,

but it’s also reasonably likely to get 45 or 57.

So if you get 57 heads, the coin might be rigged, but you might just be lucky.

●Let’s work out the math.  数学を解いてみよう。

Let’s say we look for a p value of 0.05 or less, as scientists typically do.

That is, if I count up the number of heads after 10 or 100 trials and find a deviation from what I’d expect – half heads, half tails –
つまり、もし私が、表の数を数えて|10回か100回試行した後|、ずれを求めて|予想する - 半分表、半分裏 - からの|、

I call the coin unfair if there’s only a 5% chance of getting a deviation that size or larger with a fair coin.

Otherwise, I can conclude nothing:  そうでなければ、私は、何も結論できない。

the coin may be fair, or it may be only a little unfair.

I can’t tell. 私には、わかりません。

●So, what happens if I flip a coin ten times and apply these criteria?

図 表がでる確率 対 統計力

●This is called a power curve.  これは、統計力曲線と呼ばれます。

Along the horizontal axis, we have the different possibilities for the coin’s true probability of getting heads, corresponding to different levels of unfairness.

On the vertical axis is the probability that I will conclude the coin is rigged after ten tosses, based on the p value of the result.

●You can see that if the coin is rigged to give heads 60% of the time, and I flip the coin 10 times, I only have a 20% chance of concluding that it’s rigged.

There’s just too little data to separate rigging from random variation.

The coin would have to be incredibly biased for me to always notice.

●But what if I flip the coin 100 times? しかし、コインを100回なげるとどうなる?

●With one thousand flips, I can easily tell if the coin is rigged to give heads 60% of the time.

It’s just overwhelmingly unlikely that I could flip a fair coin 1,000 times and get more than 600 heads.

The power of being underpowered 偉力低下された統計力

●After hearing all this, you might think calculations of statistical power are essential to medical trials.

A scientist might want to know how many patients are needed to test if a new medication improves survival by more than 10%,

and a quick calculation of statistical power would provide the answer.

Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of concluding there’s a real effect.

●However, few scientists ever perform this calculation, and few journal articles ever mention the statistical power of their tests.

●Consider a trial testing two different treatments for the same condition.

You might want to know which medicine is safer, but unfortunately, side effects are rare.

You can test each medicine on a hundred patients,  100人の患者に薬品の試験を行いますが、

but only a few in each group suffer serious side effects. ほんの少しにだけ重大な副作用があります。

●Obviously, you won’t have terribly much data to compare side effect rates.

If four people have serious side effects in one group, and three in the other, you can’t tell if that’s the medication’s fault.

●Unfortunately, many trials conclude with “There was no statistically significant difference in adverse effects between groups” without noting that there was insufficient data to detect any but the largest differences.

And so doctors erroneously think the medications are equally safe, when one could well be much more dangerous than the other.

●You might think this is only a problem when the medication only has a weak effect.

But no:  しかし、違います。

in one sample of studies published between 1975 and 1990 in prestigious medical journals,

27% of randomized controlled trials gave negative results,

but 64% of these didn’t collect enough data to detect a 50% difference in primary outcome between treatment groups.

Fifty percent!  50%です。

Even if one medication decreases symptoms by 50% more than the other medication, there’s insufficient data to conclude it’s more effective.

And 84% of the negative trials didn’t have the power to detect a 25% difference.

●In neuroscience the problem is even worse.  神経科学では、問題は、さらに悪い。

Suppose we aggregate the data collected by numerous neuroscience papers investigating one particular effect and arrive at a strong estimate of the effect’s size.

The median study has only a 20% chance of being able to detect that effect.

Only after many studies were aggregated could the effect be discerned.

Similar problems arise in neuroscience studies using animal models – which raises a significant ethical concern.

If each individual study is underpowered, the true effect will only likely be discovered after many studies using many animals have been completed and analyzed, using far more animal subjects than if the study had been done properly the first time.

●That’s not to say scientists are lying when they state they detected no significant difference between groups.

You’re just misleading yourself when you assume this means there is no real difference.

There may be a difference, but the study was too small to notice it.

●Let’s consider an example we see every day. 毎日見る例を考えてみましょう。

The wrong turn on red 赤信号での誤った右折

●In the 1970s, many parts of the United States began to allow drivers to turn right at a red light.

For many years prior, road designers and civil engineers argued that allowing right turns on a red light would be a safety hazard, causing many additional crashes and pedestrian deaths.

But the 1973 oil crisis and its fallout spurred politicians to consider allowing right turn on red to save fuel wasted by commuters waiting at red lights.

●Several studies were conducted to consider the safety impact of the change.

For example, a consultant for the Virginia Department of Highways and Transportation conducted a before-and-after study of twenty intersections which began to allow right turns on red.

Before the change there were 308 accidents at the intersections;

after, there were 337 in a similar length of time.

However, this difference was not statistically significant,

and so the consultant concluded there was no safety impact.

●Several subsequent studies had similar findings:

small increases in the number of crashes, but not enough data to conclude these increases were significant.

As one report concluded, ある報告は、こう結論しています

There is no reason to suspect that pedestrian accidents involving RT operations (right turns) have increased after the adoption of [right turn on red]…

●Based on this data, more cities and states began to allow right turns at red lights.

The problem, of course, is that these studies were underpowered.

More pedestrians were being run over and more cars were involved in collisions,

but nobody collected enough data to show this conclusively

until several years later, when studies arrived clearly showing the results:

significant increases in collisions and pedestrian accidents (sometimes up to 100% increases).
衝突と歩行者事故の顕著な増加 (時には100%に至る)

The misinterpretation of underpowered studies cost lives.


Ch.03 Pseudoreplication: choose your data wisely  擬似反復:データを賢く選べ

●Many studies strive to collect more data through replication:

by repeating their measurements with additional patients or samples,

they can be more certain of their numbers  彼らは、数値により確信が持て、

and discover subtle relationships that aren’t obvious at first glance.

We’ve seen the value of additional data for improving statistical power and detecting small differences.

But what exactly counts as a replication?

●Let’s return to a medical example.  医学の例にもどりましょう。

I have two groups of 100 patients taking different medications,

and I seek to establish which medication lowers blood pressure more.

I have each group take the medication for a month to allow it to take effect,

and then I follow each group for ten days, each day testing their blood pressure.

I now have ten data points per patient and 1,000 data points per group.

●Brilliant! 1,000 data points is quite a lot,  素晴らしい。1000データ点は、実に多い。

and I can fairly easily establish whether one group has lower blood pressure than the other.

When I do calculations for statistical significance I find significant results very easily.

●But wait:  しかし、待ってください。

we expect that taking a patient’s blood pressure ten times will yield ten very similar results.

If one patient is genetically predisposed to low blood pressure, I have counted his genetics ten times.

Had I collected data from 1,000 independent patients instead of repeatedly testing 100,

I would be more confident that differences between groups came from the medicines and not from genetics and luck.

I claimed a large sample size, giving me statistically significant results and high statistical power,

but my claim is unjustified. しかし、私の要求は、正当化されないのです。

●This problem is known as pseudoreplication, and it is quite common.

After testing cells from a culture, a biologist might “replicate” his results by testing more cells from the same culture.

Neuroscientists will test multiple neurons from the same animal, incorrectly claiming they have a large sample size because they tested hundreds of neurons from just two rats.

●In statistical terms, pseudoreplication occurs when individual observations are heavily dependent on each other.

Your measurement of a patient’s blood pressure will be highly related to his blood pressure yesterday,

and your measurement of soil composition here will be highly correlated with your measurement five feet away.

There are several ways to account for this dependence while performing your statistical analysis:

●1. Average the dependent data points.  依存しているデータ点の平均をとる

For example, average all the blood pressure measurements taken from a single person.

This isn’t perfect, though;  しかし、これは、完璧ではありません、

if you measured some patients more frequently than others, this won’t be reflected in the averaged number.

You want a method that somehow counts measurements as more reliable as more are taken.

●2. Analyze each dependent data point separately.  依存しているデータ点を別個に分析する

You could perform an analysis of every patient’s blood pressure on day 5, giving you only one data point per person.

But be careful, because if you do this for every day, you’ll have problems with multiple comparisons, which we will discuss in the next chapter.

●3. Use a statistical model which accounts for the dependence, like a hierarchical model or random effects model.

●It’s important to consider each approach before analyzing your data, as each method is suited to different situations.

Pseudoreplication makes it easy to achieve significance, even though it gives you little additional information on the test subjects.

Researchers must be careful not to artificially inflate their sample sizes when they retest samples.


Ch.04 The p value and the base rate fallacy p値と基礎比率誤謬

●You’ve already seen that p values are hard to interpret.

Getting a statistically insignificant result doesn’t mean there’s no difference.

What about getting a significant result? では、有意な結果をえることはどうでしょうか?

●Let’s try an example.  一つ例を見て見ましょう。

Suppose I am testing a hundred potential cancer medications.

Only ten of these drugs actually work, but I don’t know which;

I must perform experiments to find them.  

In these experiments, I’ll look for p<0.05 gains over a placebo, demonstrating that the drug has a significant benefit.
これらの試験において、私は、探します|利得 p<0.05 を|プラシーボ(偽薬)を越える|、その薬が有意な利点があることを指名ために。

●To illustrate, each square in this grid represents one drug.

The blue squares are the drugs that work: 青い四角は、効果がある薬です。

●As we saw, most trials can’t perfectly detect every good medication.

We’ll assume my tests have a statistical power of 0.8.

Of the ten good drugs, I will correctly detect around eight of them, shown in purple:

●Of the ninety ineffectual drugs, I will conclude that about 5 have significant effects.

Why?  何故か?

Remember that p values are calculated under the assumption of no effect,

so p=0.05 means a 5% chance of falsely concluding that an ineffectual drug works.
p=0.05 は、意味します|5%の確率を|効果の無い薬が働くと誤って結論する|。

●So I perform my experiments and conclude there are 13 working drugs:

8 good drugs and 5 I’ve included erroneously, shown in red:

●The chance of any given “working” drug being truly effectual is only 62%.

If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit,
もし、私が、100個のロット(一山)から薬をランダムに選び、私の検定法にかけ、p<0.05 という統計的に有意な利点を発見したとしても、

there is only a 62% chance that the drug is actually effective.

In statistical terms, my false discovery rate – the fraction of statistically significant results which are really false positives – is 38%.
統計学の用語では、偽発見率 - 実際は偽陽性であって統計的に有意な結果が出る割合 - は、38%です。

●Because the base rate of effective cancer drugs is so low – only 10% of our hundred trial drugs actually work – most of the tested drugs do not work,

and we have many opportunities for false positives.

If I had the bad fortune of possessing a truckload of completely ineffective medicines, giving a base rate of 0%, there is a 0% chance that any statistically significant result is true.

Nevertheless, I will get a p<0.05 result for 5% of the drugs in the truck.

●You often hear people quoting p values as a sign that error is unlikely.

“There’s only a 1 in 10,000 chance this result arose as a statistical fluke,” they say, because they got p=0.0001.

No! This ignores the base rate, and is called the base rate fallacy.

Remember how p values are defined:

The P value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.

●A p value is calculated under the assumption that the medication does not work and tells us the probability of obtaining the data we did, or data more extreme than it.

It does not tell us the chance the medication is effective.

●When someone uses their p values to say they’re probably right, remember this.

Their study’s probability of error is almost certainly much higher.

In fields where most tested hypotheses are false, like early drug trials (most early drugs don’t make it through trials), it’s likely that most “statistically significant” results with p<0.05 are actually flukes.

●One good example is medical diagnostic tests.


The base rate fallacy in medical testing

●There has been some controversy over the use of mammograms in screening breast cancer. Some argue that the dangers of false positive results, such as unnecessary biopsies, surgery and chemotherapy, outweigh the benefits of early cancer detection. This is a statistical question. Let’s evaluate it.

●Suppose 0.8% of women who get mammograms have breast cancer. In 90% of women with breast cancer, the mammogram will correctly detect it. (That’s the statistical power of the test. This is an estimate, since it’s hard to tell how many cancers are missed if we don’t know they’re there.) However, among women with no breast cancer at all, about 7% will get a positive reading on the mammogram, leading to further tests and biopsies and so on. If you get a positive mammogram result, what are the chances you have breast cancer?

●Ignoring the chance that you, the reader, are male,[1] the answer is 9%.35

●Despite the test only giving false positives for 7% of cancer-free women, analogous to testing for p<0.07, 91% of positive tests are false positives.

●How did I calculate this? It’s the same method as the cancer drug example. Imagine 1,000 randomly selected women who choose to get mammograms. Eight of them (0.8%) have breast cancer. The mammogram correctly detects 90% of breast cancer cases, so about seven of the eight women will have their cancer discovered. However, there are 992 women without breast cancer, and 7% will get a false positive reading on their mammograms, giving us 70 women incorrectly told they have cancer.

In total, we have 77 women with positive mammograms, 7 of whom actually have breast cancer. Only 9% of women with positive mammograms have breast cancer.

●If you administer questions like this one to statistics students and scientific methodology instructors, more than a third fail.35 If you ask doctors, two thirds fail.10 They erroneously conclude that a p<0.05 result implies a 95% chance that the result is true – but as you can see in these examples, the likelihood of a positive result being true depends on what proportion of hypotheses tested are true. And we are very fortunate that only a small proportion of women have breast cancer at any given time.

●Examine introductory statistical textbooks and you will often find the same error. P values are counterintuitive, and the base rate fallacy is everywhere.


Taking up arms against the base rate fallacy  

●You don’t have to be performing advanced cancer research or early cancer screenings to run into the base rate fallacy. What if you’re doing social research? You’d like to survey Americans to find out how often they use guns in self-defense. Gun control arguments, after all, center on the right to self-defense, so it’s important to determine whether guns are commonly used for defense and whether that use outweighs the downsides, such as homicides.

●One way to gather this data would be through a survey. You could ask a representative sample of Americans whether they own guns and, if so, whether they’ve used the guns to defend their homes in burglaries or defend themselves from being mugged. You could compare these numbers to law enforcement statistics of gun use in homicides and make an informed decision about whether the benefits outweigh the downsides.

●Such surveys have been done, with interesting results. One 1992 telephone survey estimated that American civilians use guns in self-defense up to 2.5 million times every year – that is, about 1% of American adults have defended themselves with firearms. Now, 34% of these cases were in burglaries, giving us 845,000 burglaries stymied by gun owners. But in 1992, there were only 1.3 million burglaries committed while someone was at home. Two thirds of these occurred while the homeowners were asleep and were discovered only after the burglar had left. That leaves 430,000 burglaries involving homeowners who were at home and awake to confront the burglar – 845,000 of which, we are led to believe, were stymied by gun-toting residents.28


●What happened? Why did the survey overestimate the use of guns in self-defense? Well, for the same reason that mammograms overestimate the incidence of breast cancer: there are far more opportunities for false positives than false negatives. If 99.9% of people have never used a gun in self-defense, but 1% of those people will answer “yes” to any question for fun, and 1% want to look manlier, and 1% misunderstand the question, then you’ll end up vastly overestimating the use of guns in self-defense.

●What about false negatives? Could this effect be balanced by people who say “no” even though they gunned down a mugger last week? No. If very few people genuinely use a gun in self-defense, then there are very few opportunities for false negatives. They’re overwhelmed by the false positives.

●This is exactly analogous to the cancer drug example earlier. Here, p is the probability that someone will falsely claim they’ve used a gun in self-defense. Even if p is small, your final answer will be wildly wrong.

●To lower p, criminologists make use of more detailed surveys. The National Crime Victimization surveys, for instance, use detailed sit-down interviews with researchers where respondents are asked for details about crimes and their use of guns in self-defense. With far greater detail in the survey, researchers can better judge whether the incident meets their criteria for self-defense. The results are far smaller – something like 65,000 incidents per year, not millions. There’s a chance that survey respondents underreport such incidents, but a much smaller chance of massive overestimation.


If at first you don’t succeed, try, try again 

●The base rate fallacy shows us that false positives are much more likely than you’d expect from a p<0.05 criterion for significance. Most modern research doesn’t make one significance test, however; modern studies compare the effects of a variety of factors, seeking to find those with the most significant effects.

●For example, imagine testing whether jelly beans cause acne by testing the effect of every single jelly bean color on acne:


●As you can see, making multiple comparisons means multiple chances for a false positive. For example, if I test 20 jelly bean flavors which do not cause acne at all, and look for a correlation at p<0.05 significance, I have a 64% chance of a false positive result.54 If I test 45 materials, the chance of false positive is as high as 90%.

●It’s easy to make multiple comparisons, and it doesn’t have to be as obvious as testing twenty potential medicines. Track the symptoms of a dozen patients for a dozen weeks and test for significant benefits during any of those weeks: bam, that’s twelve comparisons. Check for the occurrence of twenty-three potential dangerous side effects: alas, you have sinned. Send out a ten-page survey asking about nuclear power plant proximity, milk consumption, age, number of male cousins, favorite pizza topping, current sock color, and a few dozen other factors for good measure, and you’ll find that something causes cancer. Ask enough questions and it’s inevitable.

●A survey of medical trials in the 1980s found that the average trial made 30 therapeutic comparisons. In more than half of the trials, the researchers had made so many comparisons that a false positive was highly likely, and the statistically significant results they did report were cast into doubt: they may have found a statistically significant effect, but it could just have easily been a false positive.54

 ●There exist techniques to correct for multiple comparisons. For example, the Bonferroni correction method says that if you make n comparisons in the trial, your criterion for significance should be p<0.05/n. This lowers the chances of a false positive to what you’d see from making only one comparison at p<0.05. However, as you can imagine, this reduces statistical power, since you’re demanding much stronger correlations before you conclude they’re statistically significant. It’s a difficult tradeoff, and tragically few papers even consider it.


Red herrings in brain imaging 

●Neuroscientists do massive numbers of comparisons regularly. They often perform fMRI studies, where a three-dimensional image of the brain is taken before and after the subject performs some task. The images show blood flow in the brain, revealing which parts of the brain are most active when a person performs different tasks.

●But how do you decide which regions of the brain are active during the task? A simple method is to divide the brain image into small cubes called voxels. A voxel in the “before” image is compared to the voxel in the “after” image, and if the difference in blood flow is significant, you conclude that part of the brain was involved in the task. Trouble is, there are thousands of voxels to compare and many opportunities for false positives.

●One study, for instance, tested the effects of an “open-ended mentalizing task” on participants. Subjects were shown “a series of photographs depicting human individuals in social situations with a specified emotional valence,” and asked to “determine what emotion the individual in the photo must have been experiencing.” You can imagine how various emotional and logical centers of the brain would light up during this test.

●The data was analyzed, and certain brain regions found to change activity during the task. Comparison of images made before and after the mentalizing task showed a p=0.001 difference in a 81mm3 cluster in the brain.

●The study participants? Not college undergraduates paid $10 for their time, as is usual. No, the test subject was one 3.8-pound Atlantic salmon, which “was not alive at the time of scanning.”8

●Of course, most neuroscience studies are more sophisticated than this; there are methods of looking for clusters of voxels which all change together, along with techniques for controlling the rate of false positives even when thousands of statistical tests are made. These methods are now widespread in the neuroscience literature, and few papers make such simple errors as I described. Unfortunately, almost every paper tackles the problem differently; a review of 241 fMRI studies found that they performed 223 unique analysis strategies, which, as we will discuss later, gives the researchers great flexibility to achieve statistically significant results.


Controlling the false discovery rate 

●I mentioned earlier that techniques exist to correct for multiple comparisons. The Bonferroni procedure, for instance, says that you can get the right false positive rate by looking for p<0.05/n, where n is the number of statistical tests you’re performing. If you perform a study which makes twenty comparisons, you can use a threshold of p<0.0025 to be assured that there is only a 5% chance you will falsely decide a nonexistent effect is statistically significant.

●This has drawbacks. By lowering the p threshold required to declare a result statistically significant, you decrease your statistical power greatly, and fail to detect true effects as well as false ones. There are more sophisticated procedures than the Bonferroni correction which take advantage of certain statistical properties of the problem to improve the statistical power, but they are not magic solutions.

●Worse, they don’t spare you from the base rate fallacy. You can still be misled by your p threshold and falsely claim there’s “only a 5% chance I’m wrong” – you just eliminate some of the false positives. A scientist is more interested in the false discovery rate: what fraction of my statistically significant results are false positives? Is there a statistical test that will let me control this fraction?

●For many years the answer was simply “no.” As you saw in the section on the base rate fallacy, we can compute the false discovery rate if we make an assumption about how many of our tested hypotheses are true – but we’d rather find that out from the data, rather than guessing.

●In 1995, Benjamini and Hochberg provided a better answer. They devised an exceptionally simple procedure which tells you which p values to consider statistically significant. I’ve been saving you from mathematical details so far, but to illustrate just how simple the procedure is, here it is:

1. Perform your statistical tests and get the p value for each. Make a list and sort it in ascending order.
2. Choose a false-discovery rate and call it q. Call the number of statistical tests m.

3. Find the largest p value such that p≤iq/m, where i is the p value’s place in the sorted list.

4. Call that p value and all smaller than it statistically significant.

●You’re done! The procedure guarantees that out of all statistically significant results, no more than q percent will be false positives.7

●The Benjamini-Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists in certain fields. It usually provides better statistical power than the Bonferroni correction and friends while giving more intuitive results. It can be applied in many different situations, and variations on the procedure provide better statistical power when testing certain kinds of data.

●Of course, it’s not perfect. In certain strange situations, the Benjamini-Hochberg procedure gives silly results, and it has been mathematically shown that it is always possible to beat it in controlling the false discovery rate. But it’s a start, and it’s much better than nothing.


Ch.05 When differences in significance aren't significant differences


Ch.06 Stopping rules and regression to the mean 停止規則と平均への回帰


Ch.07 Researcher freedom: Good vibration? 研究者の自由:


Ch.08 Everybody makes mistakes 誰もがミスをします


Ch.09 Hiding the data データの隠蔽


Ch.10 What have we wrought? 何をしてきたか?


Ch.11 What can be done? 何ができるか?





 ご意見等がありましたら、think0298(@マーク)ybb.ne.jp におよせいただければ、幸いです。

 ホームページアドレス: https://think0298.stars.ne.jp