Weighted false discovery rate controlling procedures for clinical trials

click to download pdf
Yoav Benjamini, Rami Cohen


Having identified that the lack of replicability of results in earlier phases of clinical medical research stems largely from unattended selective inference, we offer a new hierarchical weighted false discovery rate controlling testing procedure alongside the single-level weighted procedure. These address the special structure of clinical research, where the comparisons of treatments involve both primary and secondary endpoints, by assigning weights that reflect the relative importance of the endpoints in the error being controlled. In the hierarchical method, the primary endpoints and a properly weighted intersection hypothesis that represents all secondary endpoints are tested. Should the intersection hypothesis be among the rejected, individual secondary endpoints are tested. We identify configurations where each of the two procedures has the advantage. Both offer higher power than competing hierarchical (gatekeeper) familywise error-rate controlling procedures being used for drug approval. By their design, the advantage of the proposed methods is the increased power to discover effects on secondary endpoints, without giving up the rigor of addressing their multiplicity.



Replicability analysis for genome-wide association studies

R. Heller, D. Yekutieli
The Annals of Applied Statistics, 8(1):481--498.


The paramount importance of replicating associations is well recognized in the genome-wide associaton (GWA) research community, yet methods for assessing replicability of associations are scarce. Published GWA studies often combine separately the results of primary studies and of the follow-up studies. Informally, reporting the two separate meta-analyses, that of the primary studies and follow-up studies, gives a sense of the replicability of the results.
We suggest a formal empirical Bayes approach for discovering whether results have been replicated across studies, in which we estimate the optimal rejection region for discovering replicated results. We demonstrate, using realistic simulations, that the average false discovery proportion of our method remains small. We apply our method to six type two diabetes (T2D) GWA studies. Out of 803 SNPs discovered to be associated with T2D using a typical meta-analysis, we discovered 219 SNPs with replicated associations with T2D. We recommend complementing a meta-analysis with a replicability analysis for GWA studies.

Selective inference on multiple families of hypotheses

Y. Benjamini, M. Bogomolov
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
Volume 76, Issue 1, pages 297–318, January 2014


IIn many complex multiple-testing problems the hypotheses are divided into families. Given the data, families with evidence for true discoveries are selected, and hypotheses within them are tested. Neither controlling the error rate in each family separately nor controlling the error rate over all hypotheses together can assure some level of confidence about the filtration of errors within the selected families. We formulate this concern about selective inference in its generality, for a very wide class of error rates and for any selection criterion, and present an adjustment of the testing level inside the selected families that retains control of the expected average error over the selected families.

Revisiting multi-subject random effects in fMRI Advocating prevalence estimation

J.D. Rosenblatt, M. Vink, Y. Benjamini
NeuroImage. Volume 84, 1 January 2014, Pages 113–121


Random effect analysis has been introduced into fMRI research in order to generalize findings from the study group to the whole population. Generalizing findings is obviously harder than detecting activation within the study group since in order to be significant, an activation has to be larger than the inter-subject variability. Indeed, detected regions are smaller when using random effect analysis versus fixed effects. The statistical assumptions behind the classic random effect model are that the effect in each location is normally distributed over subjects, and “activation” refers to a non-null mean effect. We argue that this model is unrealistic compared to the true population variability, where due to function–anatomy inconsistencies and registration anomalies, some of the subjects are active and some are not at each brain location. 
We propose a Gaussian-mixture-random-effect that amortizes between-subject spatial disagreement and quantifies it using the prevalence of activation at each location. We present a formal definition and an estimation procedure of this prevalence. The end result of the proposed analysis is a map of the prevalence at locations with significant activation, highlighting activation regions that are common over many brains.
Prevalence estimation has several desirable properties: (a) It is more informative than the typical active/inactive paradigm. (b) In contrast to the usual display of p-values in activated regions – which trivially converge to 0 for large sample sizes – prevalence estimates converge to the true prevalence.

Discovering findings that replicate from a primary study of high dimension to a follow up study

M. Bogomolov, R. Heller
Journal of the American Statistical Association, Volume 108, Issue 504, December 2013, pages 1480-1492


We consider the problem of identifying whether findings replicate from one study of high dimension to another, when the primary study guides the selection of hypotheses to be examined in the follow-up study as well as when there is no division of roles into the primary and the follow-up study. We show that existing meta-analysis methods are not appropriate for this problem, and suggest novel methods instead. We prove that our multiple testing procedures control for appropriate error-rates.
The suggested FWER controlling procedure is valid for arbitrary dependence among the test statistics within each study. A more powerful procedure is suggested for FDR control. We prove that this procedure controls the FDR if the test statistics are independent within the primary study, and independent or have dependence of type PRDS in the follow-up study. For arbitrary dependence within the primary study, and either arbitrary dependence or dependence of type PRDS in the follow-up study, simple conservative modifications of the procedure control the FDR. We demonstrate the usefulness of these procedures via simulations and real data examples.

Selection Adjusted Confidence Intervals With More Power to Determine the Sign

A. Weinstein, W. Fithian, Y. Benjamini
Journal of the American Statistical Association
Volume 108, Issue 501, March 2013, pages 165-176


In many current large-scale problems, confidence intervals (CIs) are constructed only for the parameters that are large, as indicated by their estimators, ignoring the smaller parameters. Such selective inference poses a problem to the usual marginal CIs that no longer offer the right level of coverage, not even on the average over the selected parameters. We address this problem by developing three methods to construct short and valid CIs for the location parameter of a symmetric unimodal distribution, while conditioning on its estimator being larger than some constant threshold.
In two of these methods, the CI is further required to offer early sign determination, that is, to avoid including parameters of both signs for relatively small values of the estimator. One of the two, the Conditional Quasi-Conventional CI, offers a good balance between length and sign determination while protecting from the effect of selection. The CI is not symmetric, extending more toward 0 than away from it, nor is it of constant shape. However, when the estimator is far away from the threshold, the proposed CI tends to the usual marginal one. In spite of its complexity, it is specified by closed form expressions, up to a small set of constants that are each the solution of a single variable equation.
When multiple testing procedures are used to control the false discovery rate or other error rates, the resulting threshold for selecting may be data dependent. We show that conditioning the above CIs on the data-dependent threshold still offers false coverage-statement rate (FCR) for many widely used testing procedures. For these reasons, the conditional CIs for the parameters selected this way are an attractive alternative to the available general FCR adjusted intervals. We demonstrate the use of the method in the analysis of some 14,000 correlations between hormone change and brain activity change in response to the subjects being exposed to stressful movie clips. Supplementary materials for this article are available online.

Measuring behavior of animals models faults and remedies

E. Fonio, I. Golani, Y. Benjamini
Nature Methods. 2012 Dec;9(12):1167-70. doi: 10.1038/nmeth.2252.


Widely used behavioral assays need re-evaluation and validation against their intended use. We focus here on measures of chronic anxiety in mouse models and posit that widely used assays such as the open-field test are performed at the wrong time, for inadequate durations and using inappropriate mouse strains. We propose that behavioral assays be screened for usefulness on the basis of their replicability across laboratories.

Ten ways to improve the quality of descriptions of whole-animal movement

Y. Benjamini, D. Lipkind, G. Horev, E. Fonio, N. Kafkafi, I. Golani
Neuroscience & Biobehavioral Reviews: Volume 34, Issue 8, July 2010, Pages 1351–1365


The demand for replicability of behavioral results across laboratories is viewed as a burden in behavior genetics. We demonstrate how it can become an asset offering a quantitative criterion that guides the design of better ways to describe behavior.
Passing the high benchmark dictated by the replicability demand requires less stressful and less restraining experimental setups, less noisy data, individually customized cutoff points between the building blocks of movement, and less variable yet discriminative dynamic representations that would capture more faithfully the nature of the behavior, unmasking similarities and differences and revealing novel animal-centered measures.Here we review ten tools that enhance replicability without compromising discrimination. While we demonstrate the usefulness of these tools in the context of inbred mouse exploratory behavior they can readily be used in any study involving a high-resolution analysis of spatial behavior. Viewing replicability as a design concept and using the ten methodological improvements may prove useful in many fields not necessarily related to spatial behavior.

Screening for Partial Conjunction Hypotheses

Y. Benjamini, R. Heller
2008 December; 64(4): 1215–1222. Published online 2008 February 6. doi: 10.1111/j.1541-0420.2007.00984.x


SUMMARY: We consider the problem of testing for partial conjunction of hypothesis, which argues that at least u out of n tested hypotheses are false. It offers an in-between approach to the testing of the conjunction of null hypotheses against the alternative that at least one is not, and the testing of the disjunction of null hypotheses against the alternative that all hypotheses are not null. We suggest powerful test statistics for testing such a partial conjunction hypothesis that are valid under dependence between the test statistics as well as under independence.
We then address the problem of testing many partial conjunction hypotheses simultaneously using the false discovery rate (FDR) approach. We prove that if the FDR controlling procedure in Benjamini and Hochberg (1995, Journal of the Royal Statistical Society, Series B 57, 289-300) is used for this purpose the FDR is controlled under various dependency structures. Moreover, we can screen at all levels simultaneously in order to display the findings on a superimposed map and still control an appropriate FDR measure. We apply the method to examples from microarray analysis and functional magnetic resonance imaging (fMRI), two application areas where the need for partial conjunction analysis has been identified.

Hierarchical False Discovery Rate–Controlling Methodology

D. Yekutieli
Journal of the American Statistical Association
Volume 103, Issue 481, March 2008, pages 309-316


We discuss methodology for controlling the false discovery rate (FDR) in complex large-scale studies that involve testing multiple families of hypotheses; the tested hypotheses are arranged in a tree of disjoint subfamilies, and the subfamilies of hypotheses are hierarchically tested by the Benjamini and Hochberg FDR-controlling (BH) procedure. We derive an approximation for the multiple family FDR for independently distributed test statistics: q, the level at which the BH procedure is applied, times the number of families tested plus the number of discoveries, divided by the number of discoveries plus 1. We provide a universal bound for the FDR of the discoveries in the new hierarchical testing approach, 2 × 1.44 × q, and demonstrate in simulations that when the data has an hierarchical structure the new testing approach can be considerably more powerful than the BH procedure.

False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters

Y. Benjamini, D. Yekutieli
Journal of the American Statistical Association
Volume 100, Issue 469, 2005 pages 71-81


Often in applied research, confidence intervals (CIs) are constructed or reported only for parameters selected after viewing the data. We show that such selected intervals fail to provide the assumed coverage probability. By generalizing the false discovery rate (FDR) approach from multiple testing to selected multiple CIs, we suggest the false coverage-statement rate (FCR) as a measure of interval coverage following selection. A general procedure is then introduced, offering FCR control at level q under any selection rule.
The procedure constructs a marginal CI for each selected parameter, but instead of the confidence level 1 − q being used marginally, q is divided by the number of parameters considered and multiplied by the number selected. If we further use the FDR controlling testing procedure of Benjamini and Hochberg for selecting the parameters, the newly suggested procedure offers CIs that are dual to the testing procedure and are shown to be optimal in the independent case. Under the positive regression dependency condition of Benjamini and Yekutieli, the FCR is controlled for one-sided tests and CIs, as well as for a modification for two-sided testing. Results for general dependency are also given. Finally, using the equivalence of the CIs to testing, we prove that the procedure of Benjamini and Hochberg offers directional FDR control as conjectured.

Genotype–environment interactions in mouse behavior: A way out of the problem

N. Kafkafi, Y. Benjamini, A. Sakov, G. I. Elmer, I. Golani
Biological Sciences - Neuroscience:
PNAS 2005 102 (12) 4619-4624; published ahead of print March 11, 2005,doi:10.1073/pnas.0409554102 


In behavior genetics, behavioral patterns of mouse genotypes, such as inbred strains, crosses, and knockouts, are characterized and compared to associate them with particular gene loci. Such genotype differences, however, are usually established in single-laboratory experiments, and questions have been raised regarding the replicability of the results in other laboratories. A recent multilaboratory experiment found significant laboratory effects and genotype × laboratory interactions even after rigorous standardization, raising the concern that results are idiosyncratic to a particular laboratory. This finding may be regarded by some critics as a serious shortcoming in behavior genetics. 
 A different strategy is offered here: (i) recognize that even after investing much effort in identifying and eliminating causes for laboratory differences, genotype × laboratory interaction is an unavoidable fact of life. (ii) Incorporate this understanding into the statistical analysis of multilaboratory experiments using the mixed model. Such a statistical approach sets a higher benchmark for finding significant genotype differences. (iii) Develop behavioral assays and endpoints that are able to discriminate genetic differences even over the background of the interaction. (iv) Use the publicly available multilaboratory results in single-laboratory experiments. We use software-based strategy for exploring exploration (SEE) to analyze the open-field behavior in eight genotypes across three laboratories. Our results demonstrate that replicable behavioral measures can be practically established. Even though we address the replicability problem in behavioral genetics, our strategy is also applicable in other areas where concern about replicability has been raised.

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Y. Benjamini, Y. Hochberg
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
Vol. 57, No. 1 (1995), pp. 289-300


The common approach to the multiplicity problem calls for controlling the familywise 
error rate (FWER). This approach, though, has faults, and we point out a few. A different 
approach to problems of multiple significance testing is presented. It calls for controlling 
the expected proportion of falsely rejected hypotheses -the false discovery rate.
This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. There- 
fore, in problems where the control of the false discovery rate rather than that of the 
FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni- 
type procedure is proved to control the false discovery rate for independent test statistics, 
and a simulation study shows that the gain in power is substantial. The use of the new 
procedure and the appropriateness of the criterion are illustrated with examples