ラインハートダメな統計学 (2017)

2020.9.14

　この本は、Alex Reinhart Statistics Done Wrong (2015) の翻訳です。

科学者も、統計を誤用していると指摘する本で、勉強になります。

著者は、出版前から、web版を公開していました。

　　　　https://www.statisticsdonewrong.com/

書籍版は、web版をかなり詳しくしたもので、勉強するなら書籍版をお勧めしますが、

出版後もこのサイトは、維持されていて、web版を読むことができます。

　また、翻訳者の西原史暁さんによる日本語解説が、西原さんのサイトにあります。

　　　　https://id.fnshr.info/2015/04/04/stats-done-wrong-book/

　web版と書籍版の違いについての説明があるだけでなく、

西原さんによるweb版の翻訳のpdf版のダウンロードが可能です。

　著者のサイトに、原典サイトを示すなら、web版の翻訳は自由と書かれていますので、

一部、対訳を掲載して、統計分野の英文読解の能力開発に貢献したいと思います。

書籍版目次

Introduction　はじめに

Ch.01 An introduction to statistical significance　統計有意性入門

Ch.02 Statistical power and underpowered statistics
　　　　統計の偉力と偉力低下された統計

Ch.03 Pseudoreplication: choose your data wisely
　　　　擬似反復：データを賢く選べ

Ch.04 The p value and the base rate fallacy　p値と基礎比率誤謬

Ch.05 Bad judges of significance　有意性の間違った判断

Ch.06 Double-dipping in the data　データの二度漬け

Ch.07 Continuity errors　連続性誤差

Ch.08 Model Abuse　モデルの誤用

Ch.09 Researcher freedom: Good vibration?　研究者の自由：

Ch.10 Everybody makes mistakes　誰もがミスをします

Ch.11 Hiding the data　データの隠蔽

Ch.12 What can be done?　何ができるか？

web版目次

Introduction　はじめに

Ch.01 An introduction to data analysis　データ分析入門

Ch.02 Statistical power and underpowered statistics
　　　　統計の偉力と偉力低下された統計

Ch.03 Pseudoreplication: choose your data wisely
　　　　擬似反復：データを賢く選べ

Ch.04 The p value and the base rate fallacy　p値と基礎比率誤謬

Ch.05 When differences in significance aren't significant differences
　　　　いつ有意性の誤差が、重大な誤差にならないのか？

Ch.06 Stopping rules and regression to the mean　停止規則と平均への回帰

Ch.07 Researcher freedom: Good vibration?　研究者の自由：

Ch.08 Everybody makes mistakes　誰もがミスをします

Ch.09 Hiding the data　データの隠蔽

Ch.10 What have we wrought?　何をしてきたか？

Ch.11 What can be done?　何ができるか？

●Introduction　はじめに

●In the final chapter of his famous book How to Lie with Statistics,
最後の章で｜彼の有名な本「統計で如何にして嘘をつくか」の｜、

Darrell Huff tells us that “anything smacking of the medical profession” or published by scientific laboratories and universities is worthy of our trust – not unconditional trust, but certainly more trust than we’d afford the media or shifty politicians.
ダレル･ハフは、語る｜「専門のお医者さん」の気配があるものすべて、科学の研究所や大学が出版したものすべては、信頼するに値する、無条件の信頼ではないにしても、メディアやずる賢い政治家に与えるべき信頼よりはずっと大きい信頼に、価すると｜。

After all, Huff filled an entire book with the misleading statistical trickery used in politics and the media,
成程、ハフは、埋めました｜彼の本全体を｜人を惑わす統計のトリックで｜政治やメディアで使われている｜。

but few people complain about statistics done by trained professional scientists.
しかし、殆どの人は｜統計について不満を言いません｜訓練されたプロの科学者による｜。

Scientists seek understanding, not ammunition to use against political opponents.
科学者は、理解を求めるのであって、政敵に対して用いる弾薬を求めているのではありません。

●Statistical data analysis is fundamental to science.
統計によるデータ解析は、科学の基本です。

Open a random page in your favorite medical journal and you’ll be deluged with statistics:
あなたのお気に入りの医学雑誌の任意のページを開いてごらんなさい、あなたは、圧倒されるでしょう｜統計学に｜：

t tests, p values, proportional hazards models, risk ratios, logistic regressions, least-squares fits, and confidence intervals.
ｔ検定、ｐ値、比例ハザードモデル、リスク比、ロジスティック回帰、最小二乗あてはめ、信頼区間。

Statisticians have provided scientists with tools of enormous power to find order and meaning in the most complex of datasets,
統計学は、与えました｜科学者に｜とてつもない力を持つ道具を｜秩序や意味をみつけるための｜最も複雑なデータセットの中に｜。

and scientists have embraced them with glee.
科学者は、抱擁しました｜それらを｜大喜びで｜。

●They have not, however, embraced statistics education,
しかし、彼らは、統計の教育は抱擁しませんでした。

and many undergraduate programs in the sciences require no statistical training whatsoever.
そして、科学の学部課程の多くは、統計学の授業を何等求めていません。

●Since the 1980s, researchers have described numerous statistical fallacies and misconceptions in the popular peer-reviewed scientific literature,
1980年代以降、研究者は、記述してきました｜沢山の統計の誤謬や誤解を｜評判の良い査読付きの科学文献の中の｜。

and have found that many scientific papers – perhaps more than half – fall prey to these errors.
そして、見出しました｜多くの科学論文 - 恐らく半分以上 - は、これらの誤りの犠牲になっていることを｜。

Inadequate statistical power renders many studies incapable of finding what they’re looking for;
不適切な統計学の能力は、状態にしました｜多くの研究を｜彼らが探しているものをみつけることができない｜；

multiple comparisons and misinterpreted p values cause numerous false positives;
多重比較や、ｐ値の解釈の間違いは、数多くの偽陽性を引き起こしました；

flexible data analysis makes it easy to find a correlation where none exists.
柔軟なデータ解析は、容易にしました｜相関を見つけることを｜相関などない所に｜。

The problem isn’t fraud but poor statistical education – poor enough that some scientists conclude that most published research findings are probably false.
問題は、詐欺ではなく、貧弱な統計教育なのです - 余りに貧弱なので、何人かの科学者は、結論します｜公表された科学研究の発見の多くは、多分、誤りだろうと｜。

●What follows is a list of the more egregious statistical fallacies regularly committed in the name of science.
以下に続くもの(これから語ること)は、リストです｜もっと甚だしい統計の誤謬の｜科学の名前で通常行われている｜。

It assumes no knowledge of statistical methods, 　統計の手法への知識は必要としません

since many scientists receive no formal statistical training.
多くの科学者は、受けていませんから｜正式な統計学の訓練を｜。

And be warned: once you learn the fallacies, you will see them everywhere.
しかし、注意しておきます：いったん、この誤りを学んだら、あなたは、あちこちで、それを見つけるでしょう。

Don’t be alarmed. 　驚かないでください。

This isn’t an excuse to reject all modern science and return to bloodletting and leeches
これは、言い訳ではありません｜すべての近代科学を拒否し、瀉血やヒルに戻るための｜。

– it’s a call to improve the science we rely on.
これは、呼びかけです｜改良するための｜私達が頼りにする科学を｜。

●Ch.01 An introduction to data analysis　データ分析入門

●Much of experimental science comes down to measuring changes.
多くの実験科学は、帰着する｜変化の測定に｜。

Does one medicine work better than another?
ある薬は、他の薬よりもよく効くか？

Do cells with one version of a gene synthesize more of an enzyme than cells with another version?
あるバージョンの遺伝子を持つ細胞は、合成するか｜より沢山の酵素を｜別のバージョンを持つ遺伝子よりも｜？

Does one kind of signal processing algorithm detect pulsars better than another?
或る種の信号処理アルゴリズムは、パルサーを検出するか｜別のよりもよりよく｜？

Is one catalyst more effective at speeding a chemical reaction than another?
ある触媒は、より効果的か｜科学反応の高速化に｜別のよりも｜？

●Much of statistics, then, comes down to making judgments about these kinds of differences.
ですから、多くの統計は、帰着します｜判断をすることに｜こういった種類の違いについて｜。

We talk about “statistically significant differences”
私達は、「統計的に有意な違い」について語ります

because statisticians have devised ways of telling if the difference between two measurements is really big enough to ascribe to anything but chance.
何故なら、統計家は、開発したから｜告げる方法を｜二つの測定の間の違いが、実際、偶然以外の何かに帰着できるほど十分大きいかを｜。

●Suppose you’re testing cold medicines. 　あなたが、風邪薬を試験しているとしよう。

Your new medicine promises to cut the duration of cold symptoms by a day.
新薬は、約束する｜風邪の症状の持続を１日だけ短くすることを｜。

To prove this, you find twenty patients with colds and give half of them your new medicine and half a placebo.
これを証明するために、あなたは、風邪の患者を20人見つけてきて、その半分に新薬を与え、半分にプラシーボ(偽薬)を与える。

Then you track the length of their colds and find out what the average cold length was with and without the medicine.
あなたは、追跡します｜風邪の長さをー、そして、見出します｜平均の風の長さがいくつであるかを｜風薬のありと無しの場合で｜。

●But all colds aren’t identical. 　しかし、全ての風邪は、同一ではない。

Perhaps the average cold lasts a week, 　多分、平均の風邪は、１週間続きます

but some last only a few days, and others drag on for two weeks or more, straining the household Kleenex supply.
しかし、あるものは、二三日しか続かず、あるものは、２週間かそれ以上引っ張って、その家のクリネックスの供給を逼迫させます。

It’s possible that the group of ten patients receiving genuine medicine will be the unlucky types to get two-week colds, and so you’ll falsely conclude that the medicine makes things worse.
可能です｜本当の薬を受け取る患者グループの10人が、２週間の風邪をひく不運なタイプであり、この薬は、物事を悪くすると誤って結論することは｜。

How can you tell if you’ve proven your medicine works, rather than just proving that some patients are unlucky?
どうすれば言えるのでしょう｜あなたが、薬は効くと証明したと｜ただ、ある患者は不運でしたと証明するのではなく｜？

●The power of p values　ｐ値の偉力

●Statistics provides the answer. 　統計学は、答えを与えます。

If we know the distribution of typical cold cases – roughly how many patients tend to have short colds, or long colds, or average colds –
もし、解っていれば｜典型的な風の症状の分布が - おおまかにどれだけの患者が短い風邪、長い風邪、平均の風邪をひく傾向にあるか - が｜、

we can tell how likely it is for a random sample of cold patients to have cold lengths all shorter than average, or longer than average, or exactly average.
言う事ができます｜どれだけあり得るか｜任意抽出の風邪の患者が、全員平均より短い風邪か、平均より長い風邪か、丁度平均の風邪をひいてしまうか｜。

By performing a statistical test, we can answer the question “If my medication were completely ineffective, what are the chances I’d see data like what I saw?”
統計的検定を実施することにより、私達は、質問に答える事ができます｜「もし私の薬剤が完全に効果なしの場合、私が見たようなデータを見る確率はどれくらいですか？」という｜。

●That’s a bit tricky, so read it again.　　ここはいささかトリッキーなので、もう一度読んでください。

●Intuitively, we can see how this might work. 　直観的には、これがいかに働くか理解出来ます。

If I only test the medication on one person, it’s unsurprising if he has a shorter cold than average – about half of patients have colds shorter than average.
もし私が、一人にしか薬剤を試験しなかったなら、驚くべきことではありません｜もしその彼が平均より短い風邪をひいても｜、約半分の患者は、平均より短い風邪をひくのですから。

If I test the medication on ten million patients, it’s pretty damn unlikely that all of them will have shorter colds than average, unless my medication works.
もし私が、薬剤を1000万人の患者に試験したのなら、ものすごくあり得ないことです｜全員が平均より短い風邪をひくことは、薬剤が機能している場合を除いて｜。

●The common statistical tests used by scientists produce a number called the p value that quantifies this.
科学者達が用いる通常の統計的検定では、このことを定量化するｐ値と呼ばれる数値が出されます。

Here’s how it’s defined:　いかにそれが定義されるかが、これです：

●The p value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.
p値は、定義されます｜確率と｜効果がないか違いがない(帰無仮定)という仮定の下に、実際に観察されるものと同じが、それよりも極端な結果が得られる｜。

So if I give my medication to 100 patients and find that their colds are a day shorter on average,
そこで、もし私が、患者100人に薬剤を与えて、彼らの風邪が平均より１日短いことを見出したとき、

the p value of this result is the chance that, if my medication didn’t do anything at all, my 100 patients would randomly have, on average, day-or-more-shorter colds.
この結果のｐ値は、確率です｜もし私の薬剤が何もしなかったときに、患者100人が、平均して、1日以上短い風邪をランダムにひくところの｜。

Obviously, the p value depends on the size of the effect – colds shorter by four days are less likely than colds shorter by one day – and the number of patients I test the medication on.
明らかに、ｐ値は、依存します｜効果の大きさ - ４日短い風邪は１日短い風邪よりもよりありそうでない - と、薬剤試験を行った患者の数に｜。

●That’s a tricky concept to wrap your head around.
これは、やっかいな概念です｜あなたが自在に操るには｜。

A p value is not a measure of how right you are, or how significant the difference is;
ｐ値は、測度ではありません｜あなたが如何に正しいか、または、違いが如何に有意であるかの｜；

it’s a measure of how surprised you should be if there is no actual difference between the groups, but you got data suggesting there is.
それは、測度なのです｜いかにあなたが驚くべきかという｜もし、グループ間に実際の違いがないのに、違いがあるというデータが得られた場合に｜。

A bigger difference, or one backed up by more data, suggests more surprise and a smaller p value.
より大きな違い、または、より大きなデータに支えられた違いは、示唆します｜より大きな驚きと、より地位さんｐ値を｜。

●It’s not easy to translate that into an answer to the question “is there really a difference?”
そのことを、「本当の違いはあるの？」という問いへの答えに翻訳することは、やさしいことではありません。

Most scientists use a simple rule of thumb: 　多くの科学者は、簡単なルールを使います：

if p is less than 0.05, there’s only a 5% chance of obtaining this data unless the medication really works,
もし、ｐ値が0.05以下だと、5％の確率しかありません｜このデータを得るには｜もし薬剤が働くことなしに｜。

so we will call the difference between medication and placebo “significant.”
そこで、私達は、呼びます｜薬剤と偽剤の間の違いが「有意である」と｜。

If p is larger, we’ll call the difference insignificant.　もしｐ値がより大きいと、私達は、違いは有意ではないと呼びます。

●But there are limitations. 　しかし、限界があります。

The p value is a measure of surprise, not a measure of the size of the effect.
ｐ値は、驚きの測度であって、効果の大きさの測度ではありません。

I can get a tiny p value by either measuring a huge effect – “this medicine makes people live four times longer” – or by measuring a tiny effect with great certainty.
私は、得ることができます｜小さいｐ値を｜巨大な効果 - 「この薬剤は４倍長生きさせます」 - を測定するか、または、確実性の高い小さな効果を測定することにより｜。

Statistical significance does not mean your result has any practical significance.
統計的有意さは、意味しません｜結果が実際的な有意さを持つことを｜。

●Similarly, statistical insignificance is hard to interpret.
同様に、統計的な非有意性も、解釈が難しい。

I could have a perfectly good medicine, 　私が完全に良い薬を持っているとして、

but if I test it on ten people, I’d be hard-pressed to tell the difference between a real improvement in the patients and plain good luck.
それを10人にしか試験しなかったら、私は、せっぱ詰まるでしょう｜患者への本当の改良なのか単なる幸運なのか違いを言えといわれても｜。

Alternately, I might test it on thousands of people,
あるいは、私が、1000人にテストしたとして、

but the medication only shortens colds by three minutes, and so I’m simply incapable of detecting the difference.
薬が、３分だけ、風邪を短くしたとすると、私は、単に、違いを検出することはできません。

A statistically insignificant difference does not mean there is no difference at all.
統計的に非有意な差とは、違いが全くないことは意味しません。

●There’s no mathematical tool to tell you if your hypothesis is true;
数学的道具はありません｜あなたの仮説が本当かどうかを判断する｜；

you can only see whether it is consistent with the data,
あなたは、みることができるだけです｜仮説がデータと矛盾しないかどうかを｜、

and if the data is sparse or unclear, your conclusions are uncertain.
そして、データが少ないか明確でなければ、あなたの結論も不明確になります。

●But we can’t let that stop us.　しかし、それで止めるわけにはいかない。

●Ch.02 Statistical power and underpowered statistics
　　　　統計力と偉力低下された統計

●We’ve seen that it’s possible to miss a real effect simply by not taking enough data.
私達は、見てきました｜真の効果を見失う可能性があることを｜単に十分なデータをとらないだけで｜。

In most cases, this is a problem: 　殆どの場合、これが問題です。

we might miss a viable medicine or fail to notice an important side-effect.
私達は、成功しそうな薬を見逃すかもしれないし、重大な副作用に気づくことに失敗するかもしれない。

How do we know how much data to collect?
いかにして知ることができるのか｜データをどれだけ集めるべきか｜？

●Statisticians provide the answer in the form of “statistical power.”
統計学者は、提供します｜解答を｜「統計力」という形で｜。

The power of a study is the likelihood that it will distinguish an effect of a certain size from pure luck.
研究の力とは、可能性のことです｜ある程度の大きさの効果を単なる幸運から分別する｜。

A study might easily detect a huge benefit from a medication,
研究は、簡単に検出できるかもしれません｜薬剤から得られる巨大な利益なら｜。

but detecting a subtle difference is much less likely.
しかし、微妙な違いを検出することは、ずっと可能性が低いのです。

Let’s try a simple example.　簡単な例を見てみよう。

●Suppose a gambler is convinced that an opponent has an unfair coin:
ギャンブラーが、相手が不正なコインを持っていると確信しているとしよう：

rather than getting heads half the time and tails half the time, the proportion is different,
時間の半分、表が出て、時間の半分、裏が出るのではなく、比率が異なっているのです、

and the opponent is using this to cheat at incredibly boring coin-flipping games.
相手は、これを使って、信じられないほど退屈なコイン投げゲームで、胡麻化しているのです。

How to prove it?　いかにして証明するか？

●You can’t just flip the coin a hundred times and count the heads.
あなたは、できません｜単にコインを100回投げて、表を数えるなんてことは｜。

Even with a perfectly fair coin, you don’t always get fifty heads:
完全に公平なコインですら、常に50回表がでるとは限りません。

図　表の数　対　確率

●This shows the likelihood of getting different numbers of heads, if you flip a coin a hundred times.
この図は、示します｜表が出る数の確率を｜コインを100回投げた時｜。

●You can see that 50 heads is the most likely option,
わかります｜50回表が、最もあり得るオプションであることが｜。

but it’s also reasonably likely to get 45 or 57.
しかし、45回や57回を得ることも、かなりあり得ます。

So if you get 57 heads, the coin might be rigged, but you might just be lucky.
そこで、57回表がでたとして、コインは、不正操作されていたかもしれないし、単に幸運だったのかもしれない。

●Let’s work out the math. 　数学を解いてみよう。

Let’s say we look for a p value of 0.05 or less, as scientists typically do.
例えば、ｐ値が0.05以下になる所を見つけよう、く科学者が典型的にするように。

That is, if I count up the number of heads after 10 or 100 trials and find a deviation from what I’d expect – half heads, half tails –
つまり、もし私が、表の数を数えて｜10回か100回試行した後｜、ずれを求めて｜予想する - 半分表、半分裏 - からの｜、

I call the coin unfair if there’s only a 5% chance of getting a deviation that size or larger with a fair coin.
私は、そのコインは不正と呼びます｜もし5％しかなければ｜公正なコインとのずれがその程度かより大きいくなる確率が｜。

Otherwise, I can conclude nothing: 　そうでなければ、私は、何も結論できない。

the coin may be fair, or it may be only a little unfair.
コインは、公正かもしれない、または、ちょっと不正かもしれない。

I can’t tell.　私には、わかりません。

●So, what happens if I flip a coin ten times and apply these criteria?
すると、何が起きますか｜もしコインを10回投げて、この基準を適用すると｜？

図　表がでる確率　対　統計力

●This is called a power curve. 　これは、統計力曲線と呼ばれます。

Along the horizontal axis, we have the different possibilities for the coin’s true probability of getting heads, corresponding to different levels of unfairness.
横軸には、様々な可能性をとります｜コインが表になる真の確率の｜不正の様々なレベルに応じて｜。

On the vertical axis is the probability that I will conclude the coin is rigged after ten tosses, based on the p value of the result.
縦軸は、確率です｜私が、コインが不正操作を受けていると結論する｜10回のトスの後、その結果のｐ値に基いて｜。

●You can see that if the coin is rigged to give heads 60% of the time, and I flip the coin 10 times, I only have a 20% chance of concluding that it’s rigged.
わかります｜もし、コインが不正操作されていて時間の60％表がでるとして、私が10回コインを投げると、それが不正操作されていると結論できる確率は、20％であることが｜。

There’s just too little data to separate rigging from random variation.
データが少なすぎるのです｜ランダム変動から不正操作を分離するためには｜。

The coin would have to be incredibly biased for me to always notice.
コインは、信じられないくらい偏っていないといけないのです｜私が常に気づくためには｜。

●But what if I flip the coin 100 times?　しかし、コインを100回なげるとどうなる？

●With one thousand flips, I can easily tell if the coin is rigged to give heads 60% of the time.
1000回投げると、私は、簡単にわかります｜コインは、不正操作されていて、時間の60％表がでることが｜。

It’s just overwhelmingly unlikely that I could flip a fair coin 1,000 times and get more than 600 heads.
圧倒的にもっともらしくないのです｜公正なコインを1000回投げて、600回以上表がでることは｜。

●The power of being underpowered　偉力低下された統計力

●After hearing all this, you might think calculations of statistical power are essential to medical trials.
このことを全部聞いて、あなたは、思うでしょう｜統計力の計算は、医学のトライアルに欠かせないと｜。

A scientist might want to know how many patients are needed to test if a new medication improves survival by more than 10%,
科学者は、知りたいと思うかもしれません｜何人の患者が必要か｜もし新しい薬剤が生存率を10％以上改良するかどうかテストするために｜。

and a quick calculation of statistical power would provide the answer.
統計力の簡単な計算で、その答えが提供されるかもしれません。

Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of concluding there’s a real effect.
科学者達は、通常、満足しています｜統計力が0.8かそれ以上であれば｜実際の効果があると結論づける80％の確率に対応する｜。

●However, few scientists ever perform this calculation, and few journal articles ever mention the statistical power of their tests.
しかし、殆どの科学者は、この計算を実施しないし、殆どの学術誌記事は、このテストの統計力について言及しません。

●Consider a trial testing two different treatments for the same condition.
考えましょう｜あるトライアルを｜二つの異なる治療法を同じ条件でテストする｜。

You might want to know which medicine is safer, but unfortunately, side effects are rare.
あなたは、どちらの薬がより安全かを知りたい、しかし、不幸にも、副作用は、希です。

You can test each medicine on a hundred patients, 　100人の患者に薬品の試験を行いますが、

but only a few in each group suffer serious side effects.　ほんの少しにだけ重大な副作用があります。

●Obviously, you won’t have terribly much data to compare side effect rates.
明らかに、非常に多くのデータは得られません｜副作用の率を比較するための｜。

If four people have serious side effects in one group, and three in the other, you can’t tell if that’s the medication’s fault.
もし４人が、ひとつのグループで重大な副作用をもち、３人が別のグループだとした時、それは、薬の欠陥ですと言う事はできません。

●Unfortunately, many trials conclude with “There was no statistically significant difference in adverse effects between groups” without noting that there was insufficient data to detect any but the largest differences.
不幸にも、多くのトライアルは、結論します｜「グループ間の副作用において統計的に有意な違いはありませんでした」と｜データ量が不十分で最大の違い以外は検出できないことは指摘しないで｜。

And so doctors erroneously think the medications are equally safe, when one could well be much more dangerous than the other.
そこで医者は、誤って考えます｜薬剤は等しく安全だと｜一方が他方よりもずっと危険でありえるのに｜。

●You might think this is only a problem when the medication only has a weak effect.
あなたは、思うかも知れない｜これは薬剤が弱い効果しかもたないときの問題でしかないと｜。

But no: 　しかし、違います。

in one sample of studies published between 1975 and 1990 in prestigious medical journals,
ある１例において｜1975年から1990年の間に出版された研究の｜権威ある医学雑誌に｜、

27% of randomized controlled trials gave negative results,
ランダム化制御したトライアルの27％が、否定的な結果を示していましたが、

but 64% of these didn’t collect enough data to detect a 50% difference in primary outcome between treatment groups.
このうち64％は、十分な量のデータを集めていませんでした｜治療グループ間の主要出力(評価項目)の50％差異を検出するのに｜。

Fifty percent! 　50％です。

Even if one medication decreases symptoms by 50% more than the other medication, there’s insufficient data to conclude it’s more effective.
たとえ、ある薬が、他の薬より、50％より多く症状を減らすとしても、それがもっと効果的であると結論できるだけのデータが不十分なのです。

And 84% of the negative trials didn’t have the power to detect a 25% difference.
否定的なトライアルの84％は、25％の差異を検出する統計力が無かったのです。

●In neuroscience the problem is even worse. 神経科学では、問題は、さらに悪い。

Suppose we aggregate the data collected by numerous neuroscience papers investigating one particular effect and arrive at a strong estimate of the effect’s size.
収集しているとしよう｜データを｜数多くの神経科学の論文によって集められた｜ある特別の効果を調査するために｜、そして、その効果のサイズについて強い推定値が得られたとしよう。

The median study has only a 20% chance of being able to detect that effect.
中央値の研究には、その効果を検出できる確率が20％しかない。

Only after many studies were aggregated could the effect be discerned.
多くの研究が収集された後でのみ、その効果は見つけることができます。

Similar problems arise in neuroscience studies using animal models – which raises a significant ethical concern.
同様の問題は、起こります｜神経科学の研究において｜動物モデを用いた｜、このことは、大きな倫理的関心を引き起こします。

If each individual study is underpowered, the true effect will only likely be discovered after many studies using many animals have been completed and analyzed, using far more animal subjects than if the study had been done properly the first time.
もし、個々の研究が、統計力において弱められているなら、真の効果は、発見されそうにありません｜多くの研究が、多くの動物を使って、遂行し分析された後でしか｜、もし、研究が、最初に、適切になされていたときよりも、ずっと多くの実験動物を使いながら。

●That’s not to say scientists are lying when they state they detected no significant difference between groups.
これは、科学者達が嘘をついているといっているのではありません｜彼らがグループ間に有意な違いは検出できなかったと言うときに｜。

You’re just misleading yourself when you assume this means there is no real difference.
あなたは、単に、判断を誤ったのです｜このことが真の違いはないことを意味すると思い込んだときに｜。

There may be a difference, but the study was too small to notice it.
差異はあるかもしれません、研究が、それに気付くには小さすぎたのです。

●Let’s consider an example we see every day.　毎日見る例を考えてみましょう。

●The wrong turn on red　赤信号での誤った右折

●In the 1970s, many parts of the United States began to allow drivers to turn right at a red light.
1970年代に、米国の多くの部分で、運転者は、赤信号で右折することの許可が始まりました。

For many years prior, road designers and civil engineers argued that allowing right turns on a red light would be a safety hazard, causing many additional crashes and pedestrian deaths.
先立つ長年の間、道路設計者や土木技師達は、議論しました｜赤信号で右折を許すことは、安全上の危険であって、事故の増加や歩行者の死亡を引き起こすと｜。

But the 1973 oil crisis and its fallout spurred politicians to consider allowing right turn on red to save fuel wasted by commuters waiting at red lights.
しかし、1973年の石油危機とその副産物は、政治家に拍車をかけました｜赤信号で右折を許可することは、赤信号で待っている通勤者が浪費する燃料を節約することになると考えるように｜。

●Several studies were conducted to consider the safety impact of the change.
いくつかの研究が実施されました｜この変化が安全性に与えるインパクトを考察するために｜。

For example, a consultant for the Virginia Department of Highways and Transportation conducted a before-and-after study of twenty intersections which began to allow right turns on red.
例えば、バージニアの高速道路・運輸部門のコンサルタントは、実施しました｜20箇所の交差点での事前事後研究を｜赤信号で右折の許可を始めた｜。

Before the change there were 308 accidents at the intersections;
変化前に、交差点で308個の事故がありました。

after, there were 337 in a similar length of time.
変化後、同様の時間に、337回の事故がありました。

However, this difference was not statistically significant,
しかし、この差異は、統計的に有意ではありませんでした。

and so the consultant concluded there was no safety impact.
そこで、コンサルタントは、安全性のインパクトは無いと結論しました。

●Several subsequent studies had similar findings:
これに続くいくつかの研究も、同様の結果でした：

small increases in the number of crashes, but not enough data to conclude these increases were significant.
交通事故の数は少し増加、しかし、この増加は重大であると結論するには不十分のデータ。

As one report concluded,　ある報告は、こう結論しています

There is no reason to suspect that pedestrian accidents involving RT operations (right turns) have increased after the adoption of [right turn on red]…

●Based on this data, more cities and states began to allow right turns at red lights.
このデータに基づいて、さらに多くの都市や州で、赤信号での右折が認められるようになりました

The problem, of course, is that these studies were underpowered.
問題は、勿論、これらの研究の統計力が弱められていることです。

More pedestrians were being run over and more cars were involved in collisions,
より多くの歩行者が轢かれ、より多くの車が衝突に巻き込まれたのです

but nobody collected enough data to show this conclusively
しかし、誰も、このことを決定的に示すに十分なデータを集めなかったのです。

until several years later, when studies arrived clearly showing the results:
数年後までは｜研究がもたらされて以下の結果を明確に示す｜、

significant increases in collisions and pedestrian accidents (sometimes up to 100% increases).
衝突と歩行者事故の顕著な増加 (時には100％に至る)

The misinterpretation of underpowered studies cost lives.
統計力の弱化した研究の誤解釈が、命を奪ったのです。

●Ch.03 Pseudoreplication: choose your data wisely 　擬似反復：データを賢く選べ

●Many studies strive to collect more data through replication:
多くの研究は、努めています｜より多くのデータを集める事を｜反復を通して｜。

by repeating their measurements with additional patients or samples,
測定を繰り返すことによって｜患者や標本を追加して｜、

they can be more certain of their numbers 　彼らは、数値により確信が持て、

and discover subtle relationships that aren’t obvious at first glance.
発見できる｜微妙なな関係を｜初見でははっきりしない｜。

We’ve seen the value of additional data for improving statistical power and detecting small differences.
私達は、見てきました｜追加データの価値を｜統計力を改良し、小さな違いを検出するために｜。

But what exactly counts as a replication?
しかし、厳密には、何を数えているのでしょう｜反復として｜？

●Let’s return to a medical example. 　医学の例にもどりましょう。

I have two groups of 100 patients taking different medications,
あります｜患者100人の二つのグループが｜異なる薬物治療を受けている｜、

and I seek to establish which medication lowers blood pressure more.
私は、求めます｜はっきりさせようと｜どちらの薬物治療がより血圧を下げるか｜。

I have each group take the medication for a month to allow it to take effect,
私は、各グループに受けさせます｜薬物治療を｜１ヶ月間｜その効果がでるように｜、

and then I follow each group for ten days, each day testing their blood pressure.
そして、フォローします｜各グループを10日間｜毎日血圧を測定して｜。

I now have ten data points per patient and 1,000 data points per group.
あります｜各患者に10データ点、各グループに1000データ点が｜。

●Brilliant! 1,000 data points is quite a lot, 素晴らしい。1000データ点は、実に多い。

and I can fairly easily establish whether one group has lower blood pressure than the other.
私は、できます｜実に容易に｜確証が｜あるグループが他のグループより低血圧であるかどうかの｜。

When I do calculations for statistical significance I find significant results very easily.
私が、統計的優位性を計算するとき、私は、見出します｜有意な結果を非常に容易に｜。

●But wait: 　しかし、待ってください。

we expect that taking a patient’s blood pressure ten times will yield ten very similar results.
私達は、予想します｜患者の血圧を10回測ることは、10個の非常によく似た結果を生むことを｜。

If one patient is genetically predisposed to low blood pressure, I have counted his genetics ten times.
もしある患者が、遺伝的に低血圧になりやすいとすると、彼の遺伝的特質を10回数えたのです。

Had I collected data from 1,000 independent patients instead of repeatedly testing 100,
もし、私が、集めていたら｜データを｜1000人の独立した患者から｜100人テストを繰り返すのではなく｜。

I would be more confident that differences between groups came from the medicines and not from genetics and luck.
私は、より確信しているでしょう｜グループ間の違いは、薬物治療によるもので、遺伝や運によるものではないことを｜。

I claimed a large sample size, giving me statistically significant results and high statistical power,
私は、要求しました｜より大きな標本サイズを｜統計的に有意な結果と高い統計力を与えてくれる｜、

but my claim is unjustified.　しかし、私の要求は、正当化されないのです。

●This problem is known as pseudoreplication, and it is quite common.
この問題は、擬似反復として知られていて、極めてありふれたものです。

After testing cells from a culture, a biologist might “replicate” his results by testing more cells from the same culture.
ある培養からの細胞をテストした後、生物学者は、「反復」するかもしれません｜彼の結果を同じ培養からのより多くの細胞をテストすることによって｜。

Neuroscientists will test multiple neurons from the same animal, incorrectly claiming they have a large sample size because they tested hundreds of neurons from just two rats.
神経科学者は、テストするでしょう｜同じ動物からの多重ニューロンを｜誤って主張しながら｜彼らは大きな標本サイズを持っていると｜彼らは、‖何百ものニューロンをテストしているというのに‖たった２匹のラットからとった‖｜。

●In statistical terms, pseudoreplication occurs when individual observations are heavily dependent on each other.
統計学の用語では、擬似反復は、発生します｜個々の観測者が互いに強く依存しているときに｜。

Your measurement of a patient’s blood pressure will be highly related to his blood pressure yesterday,
患者の血圧測定は、前日の血圧に大きく関係しているでしょう、

and your measurement of soil composition here will be highly correlated with your measurement five feet away.
あなたのここでの土壌成分の測定は、５フィート離れた測定と強く相関しているでしょう。

There are several ways to account for this dependence while performing your statistical analysis:
いくつかの方法があります｜この依存性を説明するには｜あなたが統計分析している間に｜：

●1. Average the dependent data points. 　依存しているデータ点の平均をとる

For example, average all the blood pressure measurements taken from a single person.
例えば、平均をとる｜すべての血圧測定の｜一人の人から得られた｜。

This isn’t perfect, though; 　しかし、これは、完璧ではありません、

if you measured some patients more frequently than others, this won’t be reflected in the averaged number.
もしあなたが、ある患者を別の患者よりも頻繁に測定しても、そのことは、平均の数値には反映されません。

You want a method that somehow counts measurements as more reliable as more are taken.
あなたは、欲します｜ある方法を｜どうにかして測定をみなす｜より多くとられればより信頼が増すように｜。

●2. Analyze each dependent data point separately. 　依存しているデータ点を別個に分析する

You could perform an analysis of every patient’s blood pressure on day 5, giving you only one data point per person.
あなたは、実施できます｜各患者の５日目の血圧の解析を｜、各患者のデータ点は、一つだけとなります。

But be careful, because if you do this for every day, you’ll have problems with multiple comparisons, which we will discuss in the next chapter.
しかし、注意してください、何故なら、あなたは、これを毎日やると、多重比較の問題があるからです、このことは、次の章で議論します。

●3. Use a statistical model which accounts for the dependence, like a hierarchical model or random effects model.
依存性を説明する統計モデルを使用する、例えば、階層モデルやランダム効果モデル

●It’s important to consider each approach before analyzing your data, as each method is suited to different situations.
重要です｜考察することが｜各方法を｜あなたのデータを解析する前に｜各方法は、異なる状況に帝号しているので｜。

Pseudoreplication makes it easy to achieve significance, even though it gives you little additional information on the test subjects.
擬似反復は、有意さを達成することを容易にします、それは、被験者の情報を殆ど追加しないにもかかわらず。

Researchers must be careful not to artificially inflate their sample sizes when they retest samples.
研究者は、注意深くあらねばならない｜標本サイズを人為的に増大させないように｜標本を再テストするときに｜。

●Ch.04 The p value and the base rate fallacy　p値と基礎比率誤謬

●You’ve already seen that p values are hard to interpret.
あなたは、すでに見てきました｜ｐ値は解釈が難しいことを｜。

Getting a statistically insignificant result doesn’t mean there’s no difference.
統計的に有意でない結果を得る事は、違いがないことを意味するのではありません。

What about getting a significant result? では、有意な結果をえることはどうでしょうか？

●Let’s try an example. 一つ例を見て見ましょう。

Suppose I am testing a hundred potential cancer medications.
私が、見込みのある100種のガンの治療薬をテストしているとしましょう。

Only ten of these drugs actually work, but I don’t know which;
この薬のうち、10個だけが効果がありますが、私は、どれか知りません。

I must perform experiments to find them. 　
私は、実験をしなければなりません｜それを見つけるために｜。

In these experiments, I’ll look for p<0.05 gains over a placebo, demonstrating that the drug has a significant benefit.
これらの試験において、私は、探します｜利得 p<0.05 を｜プラシーボ(偽薬)を越える｜、その薬が有意な利点があることを指名ために。

●To illustrate, each square in this grid represents one drug.
図示のため、グリッドの各四角は、薬を示すとします。

The blue squares are the drugs that work:　青い四角は、効果がある薬です。

図

●As we saw, most trials can’t perfectly detect every good medication.
見て来たように、殆どの試験は、完璧には、すべての良い薬を検出することはできません。

We’ll assume my tests have a statistical power of 0.8.
私のテストは、統計力が0.8あると仮定しましょう。

Of the ten good drugs, I will correctly detect around eight of them, shown in purple:
10種の良薬のうち、私は、正しく検出できるでしょう｜８つぐらいは｜紫色で示される｜。

図

●Of the ninety ineffectual drugs, I will conclude that about 5 have significant effects.
90種の効果の無い薬のうち、私は、およそ５つは、有意な効果があると結論するでしょう。

Why? 　何故か？

Remember that p values are calculated under the assumption of no effect,
思い出して欲しい、ｐ値は計算されています｜効果が無いという仮定のもとに｜、

so p=0.05 means a 5% chance of falsely concluding that an ineffectual drug works.
p=0.05 は、意味します｜５％の確率を｜効果の無い薬が働くと誤って結論する｜。

●So I perform my experiments and conclude there are 13 working drugs:
そこで、私は、実施します｜実験を｜、そして、結論します｜13種の効く薬があると｜。

8 good drugs and 5 I’ve included erroneously, shown in red:
８種の良い薬と、５種｜私が誤って含めた(赤で示した)｜：

図

●The chance of any given “working” drug being truly effectual is only 62%.
得られた「効く」薬が、本当に効く確率は、たったの62％です。

If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit,
もし、私が、100個のロット(一山)から薬をランダムに選び、私の検定法にかけ、p<0.05 という統計的に有意な利点を発見したとしても、

there is only a 62% chance that the drug is actually effective.
62％しかありません、薬が実際に効果的であるという確率は。

In statistical terms, my false discovery rate – the fraction of statistically significant results which are really false positives – is 38%.
統計学の用語では、偽発見率 - 実際は偽陽性であって統計的に有意な結果が出る割合 - は、38％です。

●Because the base rate of effective cancer drugs is so low – only 10% of our hundred trial drugs actually work – most of the tested drugs do not work,

and we have many opportunities for false positives.

If I had the bad fortune of possessing a truckload of completely ineffective medicines, giving a base rate of 0%, there is a 0% chance that any statistically significant result is true.

Nevertheless, I will get a p<0.05 result for 5% of the drugs in the truck.

●You often hear people quoting p values as a sign that error is unlikely.

“There’s only a 1 in 10,000 chance this result arose as a statistical fluke,” they say, because they got p=0.0001.

No! This ignores the base rate, and is called the base rate fallacy.

Remember how p values are defined:

The P value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.

●A p value is calculated under the assumption that the medication does not work and tells us the probability of obtaining the data we did, or data more extreme than it.

It does not tell us the chance the medication is effective.

●When someone uses their p values to say they’re probably right, remember this.

Their study’s probability of error is almost certainly much higher.

In fields where most tested hypotheses are false, like early drug trials (most early drugs don’t make it through trials), it’s likely that most “statistically significant” results with p<0.05 are actually flukes.

●One good example is medical diagnostic tests.

●The base rate fallacy in medical testing

●There has been some controversy over the use of mammograms in screening breast cancer. Some argue that the dangers of false positive results, such as unnecessary biopsies, surgery and chemotherapy, outweigh the benefits of early cancer detection. This is a statistical question. Let’s evaluate it.

●Suppose 0.8% of women who get mammograms have breast cancer. In 90% of women with breast cancer, the mammogram will correctly detect it. (That’s the statistical power of the test. This is an estimate, since it’s hard to tell how many cancers are missed if we don’t know they’re there.) However, among women with no breast cancer at all, about 7% will get a positive reading on the mammogram, leading to further tests and biopsies and so on. If you get a positive mammogram result, what are the chances you have breast cancer?

●Ignoring the chance that you, the reader, are male,[1] the answer is 9%.35

●Despite the test only giving false positives for 7% of cancer-free women, analogous to testing for p<0.07, 91% of positive tests are false positives.

●How did I calculate this? It’s the same method as the cancer drug example. Imagine 1,000 randomly selected women who choose to get mammograms. Eight of them (0.8%) have breast cancer. The mammogram correctly detects 90% of breast cancer cases, so about seven of the eight women will have their cancer discovered. However, there are 992 women without breast cancer, and 7% will get a false positive reading on their mammograms, giving us 70 women incorrectly told they have cancer.

In total, we have 77 women with positive mammograms, 7 of whom actually have breast cancer. Only 9% of women with positive mammograms have breast cancer.

●If you administer questions like this one to statistics students and scientific methodology instructors, more than a third fail.35 If you ask doctors, two thirds fail.10 They erroneously conclude that a p<0.05 result implies a 95% chance that the result is true – but as you can see in these examples, the likelihood of a positive result being true depends on what proportion of hypotheses tested are true. And we are very fortunate that only a small proportion of women have breast cancer at any given time.

●Examine introductory statistical textbooks and you will often find the same error. P values are counterintuitive, and the base rate fallacy is everywhere.

●Taking up arms against the base rate fallacy　　

●You don’t have to be performing advanced cancer research or early cancer screenings to run into the base rate fallacy. What if you’re doing social research? You’d like to survey Americans to find out how often they use guns in self-defense. Gun control arguments, after all, center on the right to self-defense, so it’s important to determine whether guns are commonly used for defense and whether that use outweighs the downsides, such as homicides.

●One way to gather this data would be through a survey. You could ask a representative sample of Americans whether they own guns and, if so, whether they’ve used the guns to defend their homes in burglaries or defend themselves from being mugged. You could compare these numbers to law enforcement statistics of gun use in homicides and make an informed decision about whether the benefits outweigh the downsides.

●Such surveys have been done, with interesting results. One 1992 telephone survey estimated that American civilians use guns in self-defense up to 2.5 million times every year – that is, about 1% of American adults have defended themselves with firearms. Now, 34% of these cases were in burglaries, giving us 845,000 burglaries stymied by gun owners. But in 1992, there were only 1.3 million burglaries committed while someone was at home. Two thirds of these occurred while the homeowners were asleep and were discovered only after the burglar had left. That leaves 430,000 burglaries involving homeowners who were at home and awake to confront the burglar – 845,000 of which, we are led to believe, were stymied by gun-toting residents.28

●Whoops.

●What happened? Why did the survey overestimate the use of guns in self-defense? Well, for the same reason that mammograms overestimate the incidence of breast cancer: there are far more opportunities for false positives than false negatives. If 99.9% of people have never used a gun in self-defense, but 1% of those people will answer “yes” to any question for fun, and 1% want to look manlier, and 1% misunderstand the question, then you’ll end up vastly overestimating the use of guns in self-defense.

●What about false negatives? Could this effect be balanced by people who say “no” even though they gunned down a mugger last week? No. If very few people genuinely use a gun in self-defense, then there are very few opportunities for false negatives. They’re overwhelmed by the false positives.

●This is exactly analogous to the cancer drug example earlier. Here, p is the probability that someone will falsely claim they’ve used a gun in self-defense. Even if p is small, your final answer will be wildly wrong.

●To lower p, criminologists make use of more detailed surveys. The National Crime Victimization surveys, for instance, use detailed sit-down interviews with researchers where respondents are asked for details about crimes and their use of guns in self-defense. With far greater detail in the survey, researchers can better judge whether the incident meets their criteria for self-defense. The results are far smaller – something like 65,000 incidents per year, not millions. There’s a chance that survey respondents underreport such incidents, but a much smaller chance of massive overestimation.

●If at first you don’t succeed, try, try again　

●The base rate fallacy shows us that false positives are much more likely than you’d expect from a p<0.05 criterion for significance. Most modern research doesn’t make one significance test, however; modern studies compare the effects of a variety of factors, seeking to find those with the most significant effects.

●For example, imagine testing whether jelly beans cause acne by testing the effect of every single jelly bean color on acne:

図

●As you can see, making multiple comparisons means multiple chances for a false positive. For example, if I test 20 jelly bean flavors which do not cause acne at all, and look for a correlation at p<0.05 significance, I have a 64% chance of a false positive result.54 If I test 45 materials, the chance of false positive is as high as 90%.

●It’s easy to make multiple comparisons, and it doesn’t have to be as obvious as testing twenty potential medicines. Track the symptoms of a dozen patients for a dozen weeks and test for significant benefits during any of those weeks: bam, that’s twelve comparisons. Check for the occurrence of twenty-three potential dangerous side effects: alas, you have sinned. Send out a ten-page survey asking about nuclear power plant proximity, milk consumption, age, number of male cousins, favorite pizza topping, current sock color, and a few dozen other factors for good measure, and you’ll find that something causes cancer. Ask enough questions and it’s inevitable.

●A survey of medical trials in the 1980s found that the average trial made 30 therapeutic comparisons. In more than half of the trials, the researchers had made so many comparisons that a false positive was highly likely, and the statistically significant results they did report were cast into doubt: they may have found a statistically significant effect, but it could just have easily been a false positive.54

●There exist techniques to correct for multiple comparisons. For example, the Bonferroni correction method says that if you make n comparisons in the trial, your criterion for significance should be p<0.05/n. This lowers the chances of a false positive to what you’d see from making only one comparison at p<0.05. However, as you can imagine, this reduces statistical power, since you’re demanding much stronger correlations before you conclude they’re statistically significant. It’s a difficult tradeoff, and tragically few papers even consider it.

●Red herrings in brain imaging　

●Neuroscientists do massive numbers of comparisons regularly. They often perform fMRI studies, where a three-dimensional image of the brain is taken before and after the subject performs some task. The images show blood flow in the brain, revealing which parts of the brain are most active when a person performs different tasks.

●But how do you decide which regions of the brain are active during the task? A simple method is to divide the brain image into small cubes called voxels. A voxel in the “before” image is compared to the voxel in the “after” image, and if the difference in blood flow is significant, you conclude that part of the brain was involved in the task. Trouble is, there are thousands of voxels to compare and many opportunities for false positives.

●One study, for instance, tested the effects of an “open-ended mentalizing task” on participants. Subjects were shown “a series of photographs depicting human individuals in social situations with a specified emotional valence,” and asked to “determine what emotion the individual in the photo must have been experiencing.” You can imagine how various emotional and logical centers of the brain would light up during this test.

●The data was analyzed, and certain brain regions found to change activity during the task. Comparison of images made before and after the mentalizing task showed a p=0.001 difference in a 81mm3 cluster in the brain.

●The study participants? Not college undergraduates paid $10 for their time, as is usual. No, the test subject was one 3.8-pound Atlantic salmon, which “was not alive at the time of scanning.”8

●Of course, most neuroscience studies are more sophisticated than this; there are methods of looking for clusters of voxels which all change together, along with techniques for controlling the rate of false positives even when thousands of statistical tests are made. These methods are now widespread in the neuroscience literature, and few papers make such simple errors as I described. Unfortunately, almost every paper tackles the problem differently; a review of 241 fMRI studies found that they performed 223 unique analysis strategies, which, as we will discuss later, gives the researchers great flexibility to achieve statistically significant results.

●Controlling the false discovery rate　

●I mentioned earlier that techniques exist to correct for multiple comparisons. The Bonferroni procedure, for instance, says that you can get the right false positive rate by looking for p<0.05/n, where n is the number of statistical tests you’re performing. If you perform a study which makes twenty comparisons, you can use a threshold of p<0.0025 to be assured that there is only a 5% chance you will falsely decide a nonexistent effect is statistically significant.

●This has drawbacks. By lowering the p threshold required to declare a result statistically significant, you decrease your statistical power greatly, and fail to detect true effects as well as false ones. There are more sophisticated procedures than the Bonferroni correction which take advantage of certain statistical properties of the problem to improve the statistical power, but they are not magic solutions.

●Worse, they don’t spare you from the base rate fallacy. You can still be misled by your p threshold and falsely claim there’s “only a 5% chance I’m wrong” – you just eliminate some of the false positives. A scientist is more interested in the false discovery rate: what fraction of my statistically significant results are false positives? Is there a statistical test that will let me control this fraction?

●For many years the answer was simply “no.” As you saw in the section on the base rate fallacy, we can compute the false discovery rate if we make an assumption about how many of our tested hypotheses are true – but we’d rather find that out from the data, rather than guessing.

●In 1995, Benjamini and Hochberg provided a better answer. They devised an exceptionally simple procedure which tells you which p values to consider statistically significant. I’ve been saving you from mathematical details so far, but to illustrate just how simple the procedure is, here it is:

1. Perform your statistical tests and get the p value for each. Make a list and sort it in ascending order.
2. Choose a false-discovery rate and call it q. Call the number of statistical tests m.

3. Find the largest p value such that p≤iq/m, where i is the p value’s place in the sorted list.

4. Call that p value and all smaller than it statistically significant.

●You’re done! The procedure guarantees that out of all statistically significant results, no more than q percent will be false positives.7

●The Benjamini-Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists in certain fields. It usually provides better statistical power than the Bonferroni correction and friends while giving more intuitive results. It can be applied in many different situations, and variations on the procedure provide better statistical power when testing certain kinds of data.

●Of course, it’s not perfect. In certain strange situations, the Benjamini-Hochberg procedure gives silly results, and it has been mathematically shown that it is always possible to beat it in controlling the false discovery rate. But it’s a start, and it’s much better than nothing.