Review article
Recognizing Statistical Problems in Reports of Clinical Trials: a Readers’ Aid
Part 33 of a series on evaluation of scientific publications
;
Background: Readers of clinical trial reports should be able to critically evaluate the design, results, and conclusions of the trial. There are internationally accepted guidelines that define methodological standards for trial planning, statistical methods, and the display and interpretation of the results. Publications may nonetheless contain erroneous findings and interpretations.
Methods: Statistical errors can arise in the planning of the trial, the analysis and display of the results, as well as in the interpretation of p-values and treatment effects in experimental and observational clinical trials. A useful aid for readers of medical publications should include a description of the potential statistical problems without complex theoretical background information. With this aim, we discuss certain major types of statistical error that the reader should be familiar with in order to be able to interpret the conclusions of these publications more easily.
Results: Statistical errors can already arise at an early stage through the choice of the wrong question to be addressed or the wrong population to be analyzed; such errors will inevitably have consequences. Before the start of any clinical trial, a primary endpoint must be defined, the sample size must be calculated, and the trial must be appropriately registered (among other requirements). With regard to the analysis, readers should for example take into account whether a statistical analysis plan with an intention-to-treat analysis existed for the study in question. They must be able to recognize erroneous methods of displaying and comparing data, confounding, as well as incorrect interpretations of p-values, and should take these problems into account when interpreting the findings. The problem of invalid causal inferences is not restricted to observational studies.
Conclusions: Statistical errors do, indeed, arise. They should be detected as early as possible in various test instances. Nonetheless, readers should be able to judge independently whether the published clinical trial reflects meticulous and correct trial planning, appropriate display of the trial’s results, and a proper, reasoned interpretation of the findings. The published checklists are a good aid for this purpose.


In clinical research, both experimental and observational studies play a decisive role in the development of new drugs and treatment methods. Methodological standards have been established for the conduct of studies and the creation of study protocols. These include for example ICH-E9 (1) and SPIRIT (2) for randomized, clinical trials (RCTs) and SPIROS (3) for observational studies. In addition, there are standards for reporting clinical trials, such as CONSORT for RCTs (4) and STROBE for observational studies (5). In spite of this, many clinical studies are often of poor quality and contain statistical errors, significantly affecting the interpretation of study results and making it difficult for readers to evaluate these publications.
Some errors can arise already at the planning stage of clinical trials. Readers cannot change this, but still it is important that they can recognize such errors or missing quality features as it enables them to appropriately evaluate the validity of the study results. Furthermore, reports of clinical trials are difficult to evaluate if statistical methods are misunderstood or findings are not fully displayed so that wrong conclusions are drawn. In order to enable readers of clinical trial reports to recognize good statistical study quality, we have summarized in this article some typical statistical errors and important study aspects requiring special attention when reading.
Methods
Our article focusses on clinical intervention studies, evaluating specific treatments in comparison to a control group according to a pre-specified study protocol. In this category, we include RCTs and observational intervention studies (6), as most statistical principles apply to both study types. We address important errors and aspects to enable readers to better evaluate conclusions drawn from publications. This article does not provide a comprehensive overview of all relevant aspects pertaining to clinical studies, as these are described in detail in the literature. For example, we do not re-emphasize the importance of randomization and blinding or of a precise endpoint definition in this article. Rather, our aim is to raise awareness of pitfalls in the planning, evaluation and interpretation of clinical trials which we encounter in our daily work as biometricians.
Common study planning errors
The research question and hypotheses do not match the study’s aim
Underlying each study is a clinical question which needs to be pre-specified to prevent data-driven analyses. It is particularly important to define a primary endpoint at a clinically relevant point in time for which a confirmatory conclusion, i.e. a decision regarding the hypothesis to be tested, can be drawn. If a clinical trial is not registered (e.g. at https://clinicaltrials.gov), the study‘s quality is questionable, as the pre-specification of the study design, the endpoints to be evaluated and the questions to be addressed cannot be verified. Furthermore, the statistical test hypothesis formulated must match the clinical question under study. For example, it is a well-known error to conclude from an unsuccessful superiority study that the treatments are equally effective. If the aim is to demonstrate equal efficacy, an equivalence trial needs to be conducted. It tests whether the confidence interval limits of the treatment effect lie within the pre-defined clinically relevant limits. Likewise, it may be of interest whether one treatment is not inferior to another, i.e. not considerably less effective compared to the control group. For this purpose, planning of a non-inferiority trial is required. Readers should pay attention to clearly defined research questions with suitable hypotheses in order to be able to adequately evaluate conclusions (1, 7, 8).
Analysis populations are not correctly defined
Defining the analysis population is another important aspect of study planning. For this step, suitable inclusion and exclusion criteria have to be defined which are not too narrow or too wide and reflect the characteristics of the target population. Readers should pay attention to whether the sample size is fully justified and methodologically consistent with the primary outcome analysis used. Sample size planning is possible and useful for clinical observational studies too. On ethical grounds, the number of patients should be limited to the minimum number required; only then is it possible to interpret the study results with regard to the power achieved (9).
In RCTs, the intention-to-treat (ITT) principle is essential for the primary analysis. Within the ITT approach, all randomized patients are included and analyzed as part of the group they were assigned to. In superiority studies, the use of the ITT principle enables an unbiased, conservative estimation of the treatment effect which is protected by randomization (10). An ideal study where all patients provide complete data is rare. For this reason, the ICH-E9 guideline defines the principle of a full analysis set (FAS) that is intended to come as close to the ITT principle as possible (1). Excluding patients retrospectively is only permissible in a limited number of cases and should be considered thoroughly (1, 11, 12). Such a decision should be pre-specified in the study protocol and well-founded; it should be made objectively without knowledge of the randomization group. It is difficult to define the distinction between FAS population and the commonly used modified ITT (mITT) population. As a general rule, readers should be wary of analyses using mITT as the primary analysis. Some studies use this approach to justify arbitrary exclusions after randomization. In such cases, readers should critically scrutinized the dataflow diagram (according to CONSORT, [4]).
Non-randomized clinical trials are generally biased by selection effects due to the lack of randomization or non-prospectively defined inclusion and exclusion criteria. Nevertheless, the aim should be to perform an ITT analysis in such trials too. In this situation, however, it would be a mistake to perform an unadjusted group comparison. An adjustment is needed to minimize the selection effects; this is discussed in detail in the section titled “Statistical associations are often not causal”.
Common errors in the analysis
Unclear display of results in the description
It is more difficult for readers to evaluate the results of clinical trials if these are displayed incorrectly or selectively. In many cases, such an unclear (descriptive) presentation of data is unintentional, resulting, for example, from a lack of knowledge; in other cases, however, it may be undertaken deliberately in order to manipulate. For example, descriptive analyses are inadequately displayed at times. In the case of qualitative data, a comprehensive description should include absolute and relative frequencies, in the case of quantitative data measures of central tendency (mean, median) and dispersion (standard deviation, range). It is not possible to evaluate variability if no information about variance is provided. The correct measure of variability should also be displayed. Some authors prefer to present the standard error because it is generally smaller than the standard deviation. However, this value reflects the accuracy of the estimate and should not be interpreted as a measure of variability within the sample. Readers should check which statistical measurements are reported and whether these are adequate (13). Furthermore, information on missing values is absolutely essential for a comprehensive evaluation.
Errors in the reporting of endpoints
Likewise, results of the study endpoints can only be evaluated if they are presented comprehensively. Thus, apart from the number of patients included in the analysis, it is essential that a description of the endpoint at baseline and of the treatment effect by an effect estimate with corresponding confidence interval are reported. In addition to the evaluation of statistical significance, a confidence interval can give an indication of the relevance of a treatment (14, 15, 16). It is common practice to report only the p-value instead of the confidence interval and to regard small p-values as indicators of large treatment effects. This is wrong because a p-value does not only depend on the size of the underlying treatment effect, but also on sample size. The size of a treatment effect cannot be evaluated based on the p-value alone. If the sample size is very large, very small treatment effects can achieve statistical significance. Consequently, it is not useful to categorize p-values into specific value ranges (e.g. “asterisk groups”) (14, 15).
Analysis according to study protocol and statistical analysis plan
Readers should check whether presented analyses were pre-specified in the study protocol, if this can be verified. Furthermore, it should be reported whether there was a statistical analysis plan (SAP) which was finalized prior to unblinding. If analyses for endpoints or subgroups are presented which only produced interesting results during the analysis process, such analyses must be properly flagged and should be interpreted with caution (17).
The evaluation of the treatment effect should be based on the measure of effect (e.g. risk difference, relative risk) specified in the study protocol with corresponding confidence interval. For the evaluation of relative treatment effects, absolute numbers should be reported in addition. Even though the relative risk often facilitates the interpretation of results by summarizing two variables in one value, it is essential for a definite interpretation of study findings that the relative risks are presented together with the absolute risks (18).
The analysis models used and the evaluation of their assumptions should be described clearly and comprehensively. A common error, which can easily be identified if a proper description is provided, is a failure to observe the “analyzed as randomized” principle. This implies that all stratification factors of randomization are included as independent variables in the analysis model. Furthermore, adjustments should only be made for baseline variables and not for variables that are assessed during the course of the study. Sensitivity analyses for the primary analysis help readers to evaluate the robustness of the treatment effect and should be available.
Furthermore, in the case of continuous endpoints, it may occur that data are categorized or transformed contrary to the pre-specification in the study protocol or SAP, or that the change from baseline is of interest and the treatment groups have to be evaluated separately. A significant change in the intervention group, but not in the control group is a frequently reported finding. However, only the group difference is of interest (19).
Common errors in the interpretation of results
Incorrect conclusions from non-significant results
Many errors in conclusions of clinical trials occur due to incorrect interpretation of p-values (e.g. in case of non-significant results), although this issue has been highlighted repeatedly over the last 40 years (e.g. in [14–16, 20]). Altman & Bland have consistently highlighted the underlying methodological problem that “absence of evidence is not evidence of absence“ and explicitly emphasized in this context that it is not useful to describe non-significant studies as “negative” (7). The use of the term “negative study” indirectly suggests that there is no difference between the treatments studied. If the aim of a study is to demonstrate equivalence, an equivalence trial should be planned, as mentioned already in the section “The research question and hypotheses do not match the study’s aim“.
Incorrect conclusions due to multiple testing
Another error is related to multiple testing on the same data set (21, 22). This can occur, for example, whenever confirmatory testing is to be performed on more than two groups, more than one primary endpoint or at various points in time (e.g. in the final analysis and in one or more additional interim analyses). If multiple testing is not taken into account and corrected in the analysis, the probability of study results occurring by chance will be underestimated. The latter results in an overinterpretation of statistically significant p-values, as is shown in the following example of a superiority study: As long as superiority is evaluated based on one endpoint at one point in time, the type 1 error is 5% according to the study protocol. When evaluating two endpoints at the same time, the type 1 error would already amount to 9.75%, i.e. it would be almost twice as large. Multiple testing increases the likelihood that because of random variations a test result becomes statistically significant, even though there is, in reality, no treatment effect. In order to prevent incorrect conclusions arising from multiple testing, an appropriate correction procedure must be pre-specified. A brief and easy to understand overview as well as statistical correction procedures can be found, for example, in Bender et al. (21) and Victor et al. (22). Specifically for RCTs, the dealing with multiple primary endpoints is summarized, for example, in the guidelines of the European regulatory authorities (23).
Overinterpretation of exploratory analyses in comparison to confirmatory analyses
Reports of clinical trials often do not distinguish between explorative and confirmatory analyses. As shown in the Table, the objectives of the analyses and the generalizability of the conclusions are substantially different. According to Hirschauer et al., terms such as “hypothesis testing” and “statistically significant” should be entirely avoided in the context of exploratory analyses (15). Readers should carefully examine whether the term used and the interpretation are correct.
Statistical associations are often not causal
While causality problems are frequently linked to observational studies, false causal inferences are by no means limited to this study design. As soon as there is a deviation from the ITT principle, readers have to expect bias resulting from selection effects. Studies which are no longer protected by randomization and may provide biased results include analyses in the mITT population, in the per protocol (PP) population—with limitation to patients who complete the study without major protocol deviations — and analyses which are based on subgroups (24). The conclusion that only the treatment effect from the PP analysis is the “real” one, describing the true potential of the intervention, is a classical error since it also can be biased by selection effects. In superiority trials, this type of analysis should serve only as a supplementary analysis (12).
In general, statistical correlation or association does not necessarily demonstrate causality. Associations can be influenced by confounders. In observational studies, it is therefore necessary to adjust for confounders determined at baseline. Ideally, the selection of these confounders should be described based on medical theory and critically evaluated by the reader. Data-driven selection methods for variables, on the other hand, should only be used on a supplementary basis, as a relevant confounder does not have to be significant in the analysis model. Nevertheless, there may be confounders which cannot be measured and for this reason, one can only speak of associations, not causality. Likewise, a temporal sequence of events cannot imply causality (“post hoc ergo propter hoc” fallacy) (25).
Discussion
In this article, we have described some errors and missing quality features that can lead readers of clinical publications astray. Even though standards and checklists are available (26) and numerous scientists highlighted potential pitfalls and proposed solutions in the past (for example [27] or [28]), erroneous displays of results and incorrect interpretations are still recurring, even in high-ranking journals.
While such errors are made by researchers, they can be identified by various parties. For example, co-authors, reviewers and readers should critically read and evaluate the results. But how could this assessment look like? The published checklists are the most important tool. If authors were required to provide such lists upon submission, missing quality features could already be identified prior to publication. In addition, these checklists in the publications’ appendices can serve as a readers’ aid. Furthermore, study protocols and SAPs should be included in the appendix to allow verification of whether the study was conducted and analyzed according to protocol.
This article focuses solely on clinical trials, even though there are other types of studies (e.g. diagnosis studies). Nevertheless, we would like to highlight another pitfall inherent in before-and-after comparisons which are widely used as they are easy to make. Upon re-measurement, very extreme values tend to converge towards the mean (“regression towards the mean”). Readers should keep this phenomenon in mind if in a trial a before-and-after comparison is interpreted as a treatment effect of the intervention, because a potential treatment effect could have just as well occurred in the control group. This underscores the importance of a control group and raises skepticism if a control group is missing in a study (29, 30).
Apart from these errors, there are further problems that complicate the interpretation. For example, a publication bias may be inherent in the research question due to the fact that studies with a significant result are more likely to be published (31). By registering a study (for example at https: //clinicaltrials.gov) prior to its start, such biases can at least be exposed. Moreover, authors should be reminded during the publication process to name and discuss the limitations of their study results in the manuscript.
In summary, it can be condensed that statistical methods are only tools. They must be correctly applied to a robust dataset and interpreted with care. Thus, it is very important for readers of RCTs and clinical intervention studies to be familiar with these tools and put them to use.
Conflict of interest statement
AG is editor at the journal Deutsche Medizinische Wochenschrift and receives a fee for providing her expert opinion.
AS declare that no conflict of interest exists.
Manuscript received on 14 December 2023, revised version accepted on 17 May 2024.
Translated from the original German by Ralf Thoene, M.D.
Corresponding author
Dr. rer. hum. biol. Anika Großhennig
Medizinische Hochschule Hannover
Institut für Biometrie
Carl-Neuberg-Straße 1
30625 Hannover, Germany
grosshennig.anika@mh-hannover.de
Cite this as:
Suling A, Grosshennig A: Recognizing statistical problems in reports of clinical trials: a readers’ aid. Part 33 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2024; 121: 634–8.
DOI: 10.3238/arztebl.m2024.0113
Institute of Biostatistics, Hannover Medical School, Hannover, Germany: Dr. rer. hum. biol. Anika Großhennig
1. | European Medicines Agency: ICH E9 statistical principles for clinical trials. www.ema.europa.eu/en/ich-e9-statistical-principles-clinical-trials-scientific-guideline (last accessed on 14 December 2023). |
2. | Chan AW, Tetzlaff JM, Altman DG, et al.: SPIRIT 2013 statement: defining standard protocol items for clinical trials. Ann Intern Med 2013; 158: 200–7 CrossRef MEDLINE PubMed Central |
3. | Mahajan R, Burza S, Bouter LM, et al.: Standardized Protocol Items Recommendations for Observational Studies (SPIROS) for observational study protocol reporting guidelines: protocol for a delphi study. JMIR Res Protoc 2020; 9: e17864 CrossRef MEDLINE PubMed Central |
4. | Schulz KF, Altman DG, Moher D, CONSORT Group: CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340: c332 CrossRef MEDLINE PubMed Central |
5. | Ghaferi AA, Schwartz TA, Pawlik TM: STROBE reporting guidelines for observational studies. JAMA Surg 2021; 156: 577–8 CrossRef MEDLINE |
6. | Röhrig B, du Prel JB, Wachtlin D, Blettner M: Types of study in medical research—part 3 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2009; 106: 262–8 VOLLTEXT |
7. | Altman DG, Bland JM: Absence of evidence is not evidence of absence. Aust Vet J 1996; 74: 311 CrossRef MEDLINE |
8. | Wellek S, Blettner M: Establishing equivalence or non-inferiority in clinical trials—part 20 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2012; 109: 674−9 VOLLTEXT |
9. | Röhrig B, du Prel JB, Wachtlin D, Kwiecien R, Blettner M: Sample size calculation in clinical trials—part 13 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2010; 107: 552–6 VOLLTEXT |
10. | Gupta SK: Intention-to-treat concept: a review. Perspect Clin Res 2011; 2: 109–12 CrossRef MEDLINE PubMed Central |
11. | Yelland LN, Sullivan TR, Voysey M, Lee KJ, Cook JA, Forbes AB: Applying the intention-to-treat principle in practice: guidance on handling randomisation errors. Clin Trials 2015; 12: 418–23 CrossRef MEDLINE PubMed Central |
12. | Ranganathan P, Pramesh CS, Aggarwal R: Common pitfalls in statistical analysis: intention-to-treat versus per-protocol analysis. Perspect Clin Res 2016; 7: 144–6 CrossRef MEDLINE PubMed Central |
13. | Altman DG, Bland JM: Standard deviations and standard errors. BMJ 2005; 331 (7521): 903 CrossRef MEDLINE PubMed Central |
14. | Gardner MJ, Altman DG: Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 1986; 292 (6522): 746–50 CrossRef MEDLINE PubMed Central |
15. | Hirschauer N, Mußhoff O, Grüner S, Frey U, Theesfeld I, Wagner P: Die Interpretation des p-Wertes – Grundsätzliche Missverständnisse. Journal of Economics and Statistics 2016; 236: 557–75 CrossRef |
16. | Matthews JN, Altman DG: Statistics notes. Interaction 2: compare effect sizes not P values. BMJ 1996; 313 (7060): 808 CrossRef MEDLINE PubMed Central |
17. | Greenberg L, Jairath V, Pearse R, Kahan BC: Pre-specification of statistical analysis approaches in published clinical trial protocols was inadequate. J Clin Epidemiol 2018; 101: 53–60 CrossRef MEDLINE |
18. | Noordzij M, van Diepen M, Caskey FC, Jager KJ: Relative risk versus absolute risk: one cannot be interpreted without the other. Nephrol Dial Transplant 2017; 32 (Suppl 2): ii13–8 CrossRef MEDLINE |
19. | Bland JM, Altman DG: Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials 2011; 12: 264 CrossRef MEDLINE PubMed Central |
20. | Greenland S, Senn SJ, Rothman KJ, et al.: Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016; 31: 337–50 CrossRef MEDLINE PubMed Central |
21. | Bender R, Lange S, Ziegler A: [Multiple testing]. Dtsch Med Wochenschr 2007; 132 (Suppl 1): e26–9 CrossRef MEDLINE |
22. | Victor A, Elsäßer A, Hommel G, Blettner M: Judging a plethora of p-values: how to contend with the problem of multiple testing—part 10 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2010; 107: 50–6. VOLLTEXT |
23. | European Medicines Agency: Multiplicity issues in clinical trials—scientific guideline. www.ema.europa.eu/en/multiplicity-issues-clinical-trials-scientific-guideline (last accessed on 14 December 2023). |
24. | Desai M, Pieper KS, Mahaffey K: Challenges and solutions to pre- and post-randomization subgroup analyses. Curr Cardiol Rep 2014; 16: 531 CrossRef MEDLINE PubMed Central |
25. | Lederer DJ, Bell SC, Branson RD, et al.: Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. Ann Am Thorac Soc 2019; 16: 22–8 CrossRef MEDLINE |
26. | EQUATOR Network: Enhancing the QUAlity and Transparency Of health Research. www.equator-network.org/ (last accessed on 14 December 2023). |
27. | Clark GT, Mulligan R: Fifteen common mistakes encountered in clinical research. J Prosthodont Res 2011; 55: 1–6 CrossRef MEDLINE |
28. | Evans SR: Common statistical concerns in clinical trials. J Exp Stroke Transl Med 2010; 3: 1–7 CrossRef MEDLINE PubMed Central |
29. | Bland JM, Altman DG: Regression towards the mean. BMJ 1994; 308 (6942): 1499 CrossRef MEDLINE PubMed Central |
30. | Bland JM, Altman DG: Some examples of regression towards the mean. BMJ 1994; 309 (6957): 780 CrossRef MEDLINE PubMed Central |
31. | Thornton A, Lee P: Publication bias in meta-analysis: its causes and consequences. J Clin Epidemiol 2000; 53: 207–16 CrossRef MEDLINE |