Review article
One Question, Many Results—Why Epidemiological Studies Yield Heterogeneous Findings
Part 34 of a series on evaluation of scientific publications
; ; ; ;
Background: Observational epidemiological studies often yield different results on the same research question. In this article, we explain how this comes about.
Methods: In this review, which is based on publications retrieved by a selective search in PubMed and the Web of Science, we use information from international publications, simulation studies on sampling error, and a quantitative bias analysis on fictitious data to demonstrate why the results of epidemiological studies are often uncertain, and why it is, therefore, generally necessary to perform more than one study on any particular question.
Results: Sampling errors, imprecise measurements, alternative but equally appropriate methods of data analysis, and features of the populations being studied are common reasons why studies on the same question can yield different results. Simulation studies are used to illustrate the fact that effect estimates such as relative risks or odds ratios can deviate markedly from the true value because of sampling error, i.e., by chance alone. Quantitative bias analysis is used to show how strongly effect estimates can be distorted by misclassification of exposures or outcomes. Finally, it is shown through illustrative examples that different but equally appropriate methods of data analysis can lead to divergent study results.
Conclusion: The above reasons why epidemiological study results can be heterogeneous are explained in this review. Quantitative bias analyses and sensitivity analyses with alternative data evaluation strategies can help explain divergent results on one and the same question.


Epidemiological observational studies on very similar questions often come to heterogeneous conclusions. For example, several studies were conducted to determine whether persons whose spouse had type 2 diabetes were themselves more likely to contract this disease. The odds ratios (OR) for this association ranged from 1.0 (95% confidence interval [0.6; 1.6]) to 7.2 [2.9; 18.0] in cross-sectional studies and from 1.2 [0.9; 1.7] to 8.7 [7.4; 10.2] in cohort studies (1). The findings thus cover the spectrum from no association at all via weak associations to very strong associations. There is also another reason why it is inadvisable to rely on the first study to appear on a given topic. Namely, the first study on a particular research question to be published not infrequently overestimates the effect (2, 3). Such early results showing excessive effects are often published in journals with high impact factors, whereas subsequent replication studies investigating the same question and reporting weaker effects tend to appear in journals with lower impact factors (4).
The reason why the findings of epidemiological studies on very similar questions vary so widely has little to do with dishonesty or scientific misconduct. Rather, the differences result from random and systematic errors, which according to Rothman represent the principal sources of error in epidemiological studies (5). Systematic errors (bias) may be caused by confounders. Bias can also relate to how study participants are chosen (selection bias) or how study variables are quantified (information bias) (6). In this article we begin by describing the impact of chance and the effect of imprecise measurement on study findings. For reasons of space, we do not explore the effect of selection bias. We then go on to show how the choice of strategy for data evaluation—including the handling of confounding—also contributes to heterogeneity of study results. Finally, we briefly depict how the choice of the population from which the sample is drawn influences the findings. We concentrate here on epidemiological observational studies, but the phenomenon of heterogeneous findings on the same question also plays a role in clinical trials and in non-medical disciplines such as biology and psychology.
Method
A selective search of PubMed and the Web of Science was carried out to establish the most important reasons why a single epidemiological study is usually insufficient. Simulations were conducted with the software SAS 9.4 (SAS Institute, Cary, USA) to draw repeated samples of the same size from the same population. Freely available Excel sheets were used to perform quantitative bias analyses on fictive data (7).
Results
The impact of chance
Even if one assumes that epidemiological studies contain no further errors, by reason of chance alone a single study cannot be expected to show the true effect. This influence of chance is inherent in the fact that, on grounds of practicality if nothing else, epidemiologists study samples and not whole populations.
The influence of chance sampling error is illustrated below with the aid of three simulations. In all three, the OR for the association between an exposure with two states (“present” versus “absent”) and an illness (“present” versus “absent”) is to be estimated.
Simulation 1:
The true odds ratio is 1.00 and the sample size is n = 2000
In the first simulation, we created an artificial data set comprising 10 000 cases (i.e., ill persons), 3000 of whom are exposed, and 20 000 controls (i.e., healthy persons), of whom 6000 are exposed. This data set represents the population from which samples are to be drawn for estimation of the association between exposure and illness. The true OR is 1.00, because the proportion of persons exposed is identical in the cases and controls, namely 30%. One hundred random samples are now drawn from the population, each including 1000 cases and 1000 controls. In each sample, a chance deviation of the proportion of exposed persons from the true value (30%) can be expected. The OR is estimated for each of the 100 samples. Figure 1a shows that the 100 OR range from 0.74 to 1.28. In five samples the confidence interval does not include the true OR of 1.00. By chance alone, therefore, the estimated OR deviate, in some cases considerably, from the true null effect of OR = 1.00—without confounding, imprecise measurement, and selection bias.
Simulations 2 and 3:
The true odds ratio is 1.26 and the sample sizes are n = 400 and n = 2000
In the other two simulations, the population comprised 10 000 cases (including 3500 exposed persons) and 20 000 controls (including 6000 exposed persons). The equation OR = (3500/6500)/(6000/14 000) now yields a true OR of 1.26, i.e., exposed persons have 1.26 times higher odds for the disease than unexposed persons.
First, 100 random samples of 400 (200 cases, 200 controls) are drawn. The OR of these samples ranges from 0.82 to 2.40, thus sometimes deviating massively from the true value of 1.26 (Figure 1b). When 100 larger random samples are drawn, this time of 2000 (1000 cases, 1000 controls), the OR unsurprisingly vary less widely, ranging from 1.03 to 1.54 (Figure 1c). Nevertheless, even with the larger sample size the OR of some samples deviate significantly from the true value.
This random sampling error, which leads to large differences between the estimated OR and the true OR, can be reduced by using larger samples but can never be completely eliminated.
Comparison of simulations 2 and 3 shows how sample size affects the width of the confidence interval. In smaller samples the role of chance is greater and the confidence intervals are wider. With wide confidence intervals, a study is compatible with a large range of estimates. Pooling of studies in meta-analyses yields narrower confidence intervals and therefore results that are more precise, although not necessarily more valid.
The tendency towards preferential publication of statistically significant results still persists (8). In simulation 2, 12 ORs are statistically significant at the 95% level of confidence (p < 0.05), as seen from the fact that the confidence intervals do not include 1 as reference value for the null effect (shown in red in Figure 1b). In these 12 samples the median OR is actually 1.66, i.e., the statistically significant results greatly exceed the true effect (OR = 1.26). This phenomenon is termed truth inflation (9). If statistically significant results are preferentially published (publication bias), this is another reason not to rely on one single epidemiological study.
The impact of misclassifications on the result
Although chance sampling errors can be reduced by increasing sample size, this is not the case for systematic errors (bias). We will now describe how to quantify the impact of misclassification, a form of information bias, on study results.
One means of assessing the influence of misclassification on effect estimates is quantitative bias analysis. It is assumed that a certain proportion of the participating persons have been incorrectly classified with regard to exposure or outcome. Using plausible assumptions of the extent of misclassification, a corrected two-by-two table is calculated on the basis of the observed two-by-two table. Provided the assumptions are correct, the corrected table yields the true effect estimates. This will now be illusstrated by means of a fictive example.
A case–control study yields the two-by-two table shown in Table 1a, which produces an OR of 2.25 [1.74; 2.91]. In an initial scenario it is assumed that actually exposed persons were classified correctly in 98% of the cases and 90% of the controls (cf. scenario 1 in Table 2), i.e., that 2% and 10%, respectively, of those actually exposed were incorrectly categorized as not exposed. Moreover, it is assumed that those who were not exposed were classified correctly in 90% of the cases and 98% of the controls. With these four assumptions—which are none other than sensitivity and specificity for exposure (separately for cases and controls)—simple equations or freely available Excel sheets can be used to calculate the two-by-two table correcting for the assumed misclassification (7).
For scenario 1, the correction yields the two-by-two table shown in Table 1b, with a true OR of 1.28. Five more scenarios are displayed in Table 2. In scenarios 1–4, sensitivity and specificity are always quite high, at least 0.9. Nevertheless, the corrected OR deviate, in some cases considerably, from the uncorrected value (OR = 2.25; calculated under the unrealistic assumption of no misclassification). In scenarios 5 and 6, where sensitivity of 0.8 is assumed for the controls, the corrected OR are 1.11 and 1.04 respectively. This means that with the misclassifications assumed in scenarios 5 and 6, the true effects are close to the null effect.
These examples show how even modest misclassifications may lead to distinct differences between corrected and uncorrected effect estimator.
Alternative interpretations of the data
Even with an identical study question and an identical data set, analysis of the data by different persons may lead to widely varying findings. Research into this phenomenon in the field of epidemiology is sparse, in contrast to disciplines such as the neurosciences, psychology, the social sciences, and economics (10, 11, 12, 13, 14, 15, 16, 17). One well-known example is the study in which 29 groups of researchers investigated whether soccer referees showed more yellow cards to players with darker skin (14). Using exactly the same data set, the different groups came up with OR ranging from 0.89 to 2.93. Silberzahn et al. attributed the heterogeneity of the results to the numerous subjective, yet generally justified decisions that had to be made in the process of analysis. For instance, the authors discovered that the 29 research groups used no fewer than 21 different combinations of covariates.
The evaluation of epidemiological data entails numerous legitimate choices that may lead to profoundly differing results. A list of such alternative interpretations can be found in the eTable. In contrast to math problems at school, epidemiological analyses have no indisputably correct conclusion, as the following examples make clear.
Example 1: Use of continuous variables in regression models
There are various ways of fitting continuous variables such as age into a regression model, e.g., linearly, quadratically, categorically, or by means of statistically more sophisticated methods such as fractional polynomials. The precise transformation, often not reported in detail in published articles, may well have a relevant impact on the effect estimator. Groenwold et al. explored the age-adjusted association between cigarette smoking and cardiovascular death and found OR of 1.49 [1.03; 2.17] for age as linear variable; 1.40 [0.97; 2.02] for age as dichotomous (two-level categorical) variable; 1.57 [1.08; 2.31] for age as 5-level categorical variable; and 1.65 [1.13; 2.43] using fractional polynomials (18).
Example 2: Use of different adjustment sets for confounding
Simply formulated, confounders are variables that represent risk factors for the outcome of interest which are additionally associated with the exposure and are not mediators between exposure and target variable (6). It cannot be stated with certainty what confounders should ideally be adjusted for in a given analysis, so often, a large number of possible confounders exist.. Patel et al. studied the association between exposure to α-tocopherol and death (19). For 8192 (= 213) different adjustment sets formed using 13 potential confounders, Patel et al. determined the corresponding 8192 hazard ratios related to the association between tocopherol and death, and 98% of them lay between 0.88 and 1.06. Although the choice of the adjustment set in this example exerted no extreme influence on the effect estimator, it did affect whether a protective effect of α-tocopherol (hazard ratio < 1.00) was reported or not.
A third example, this time of the calculation of excess mortality, can be found in the eBox.
Insofar as the heterogeneity of study results is attributable to different evaluation strategies, it can be reduced somewhat by means of more precise questions. If, for instance, definition of the population, location, and time of the study formed part of the question, together with precise definitions of the exposure and the outcome or the nature of the hypothesis (one- or two-sided), the number of legitimate options would be reduced.
Impact of the underlying population
Study results can also vary if the question is applied to different populations or to the same population at different times. This heterogeneity results not from errors but, strictly speaking, from non-identical study questions. For example, the association between smoking and cancer in a given population can vary over time with, for instance, changes in tobacco composition or inhalation style. A further example is provided by Rothman et al. (20): A disease occurs only if factors A, B, and C, which together form a sufficient causal complex, are all present at the same time. All persons in populations 1 and 2 are exposed to C, while B—say, a particular form of nutrition—is ubiquitous in population 1 but non-existent in population 2. Although the disease-triggering mechanism is identical in the two populations, one would observe that factor A always causes disease in population 1, but never in population 2. A single study focusing on population 2 would be misleading.
Conclusion
Despite the fact that epidemiological observational studies with the same research question often have heterogeneous results, it is legitimate to ask whether one single large, well planned, well conducted, and well evaluated study might not be just as good as a large number of individual studies. Theoretically, this may be the case. In practice, however, the existence of a large variety of sources of error in epidemiological studies, which are not always easy to trace, means that errors can never be excluded with certainty in a single study. Moreover, the results in one population cannot simply be assumed to be valid for another population. Finally, study evaluation involves a multitude of legitimate choices that may lead to different results. It is therefore not sufficient to pay attention only to the results of just one decision pathway in a single data analysis.
The heterogeneity of study results can be reduced, for example, by the following measures: Chance errors can, to a certain extent, be decreased by using larger samples. Quantitative bias analysis is available for quantification of misclassifications. With regard to data evaluation, recommendations exist on how to decrease the heterogeneity of study results (e.g., multiple imputations to deal with missing values, calculation of adjustment sets to reduce confounding with the aid of directed acyclic graphs [DAG]) (21, 22). Multiverse analyses, in which more than just one subjective evaluation strategy is used, can also help to narrow the corridor of results.
Conflict of interest statement
The authors declare that no conflict of interest exists.
Submitted on 6 November 2023, revised version accepted on 20 June 2024
Translated from the original German by David Roseveare
Corresponding author
Prof. Dr. rer. nat. Dr. rer. san Bernd Kowall
Institut für Medizinische Informatik,
Biometrie und Epidemiologie (IMIBE)
Universitätsklinik Essen
Hufelandstr. 55
45147 Essen, Germany
bernd.kowall@uk-essen.de
Cite this as:
Kowall B, Stolpe S, Galetzka W, Nonnemacher M, Stang A: One question, many answers—why epidemiological studies yield heterogeneous findings. Part 34 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2024; 121: 740–5.
DOI: 10.3238/arztebl.m2024.0135
School of Public Health, Department of Epidemiology, Boston University, Boston, USA: Prof. Dr. med. Andreas Stang
1. | Appiah D, Schreiner PJ, Selvin E, Demerath EW, Pankow JS: Spousal diabetes status as a risk factor for incident type 2 diabetes: a prospective cohort study and meta-analysis. Acta Diabetol 2019; 56: 619–29 CrossRef MEDLINE PubMed Central |
2. | Fanelli D, Costas R, Ioannidis JP: Meta-assessment of bias in science. Proc Natl Acad Sci USA 2017; 114: 3714–9 CrossRef MEDLINE PubMed Central |
3. | Schooler J: Unpublished results hide the decline effect. Nature 2011; 470: 437 CrossRef MEDLINE |
4. | Amrhein V, Korner-Nievergelt F, Roth T: The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 2017; 5:e3544 CrossRef MEDLINE PubMed Central |
5. | Rothman K: Epidemiology. An Introduction. 2. Edition. New York: Oxford University Press, Inc. 2012; 124. |
6. | Hammer GP, du Prel JB, Blettner M: Avoiding bias in observational studies: part 8 in a series of articles on evaluation of scientific publications. Dtsch Arztebl Int 2009; 106: 664–8 CrossRef MEDLINE PubMed Central |
7. | Fox MP: Applying quantitative bias analysis to epidemiologic data. Misclassification spreadsheet. https://sites.google.com/site/biasanalysis (last accessed on 4 October 2023). |
8. | Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K: Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database Syst Rev 2009; 2009: MR000006 CrossRef MEDLINE PubMed Central |
9. | Ioannidis JPA: Why most discovered true associations are inflated. Epidemiology 2008; 19: 640–8 CrossRef MEDLINE |
10. | Huntington-Klein N, Arenas A, Beam E, et al.: The influence of hidden researcher decisions in microeconomics. Economic Inquiry 2021; 59: 944–60 CrossRef |
11. | Botvinik-Nezer R, Holzmeister F, Camerer CF, et al.: Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582: 84–8 CrossRef |
12. | Veronese M, Rizzo G, Belzunce M, et al.: Reproducibility of findings in modern PET neuroimaging: insight from the NRM2018 grand challenge. J Cereb Blood Flow Metab 2021; 41: 2778–96 CrossRef |
13. | Fillard P, Descoteaux M, Goh A, et al.: Quantitative evaluation of 10 tractography algorithms on a realistic diffusion MR phantom. Neuroimage 2011; 56: 220–34 CrossRef |
14. | Silberzahn R, Uhlmann EL, Martin DP, et al.: Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv Methods Pract Psychol Sci 2018; 1: 337–56 CrossRef |
15. | Breznau N, Rinke EM, Wuttke A, et al.: Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc Natl Acad Sci USA 2022; 119: e2203150119. |
16. | Salganik MJ, Lundberg I, Kindel AT, et al.: Measuring the predictability of life outcomes with a scientific mass collaboration. Proc Natl Acad Sci USA 2020; 117: 8398–403 CrossRef |
17. | Hoogeveen S, Sarafoglou A, Aczel B, et al.: A many-analysists approach to the relation between religion and well-being. Religion, Brain & Behavior 2022; 13: 237–83 CrossRef |
18. | Groenwold RHH, Klungel OH, Altman DG, van der Graaf Y, Hoes AW, Moons KGM: Adjustment for continuous confounders: an example of how to prevent residual confounding. CMAJ 2013; 185: 401–6 CrossRef |
19. | Patel CJ, Burford B, Ioannidis JPA: Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. J Clin Epidemiol 2015; 68: 1046–58 CrossRef |
20. | Rothman K, Greenland S, Lash TL: Modern Epidemiology. 3. Edition Philadelphia: Wolters Kluwer 2008. |
21. | Hayati Rezvan P, Lee KJ, Simpson JA: The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol 2015; 15: 30 CrossRef |
22. | Schipf S, Knüppel S, Hardt J, Stang A: Directed Acyclic Graphs (DAGs)—Die Anwendung kausaler Graphen in der Epidemiologie. Gesundheitswesen 2011; 73: 888–92 CrossRef |
e1. | Kowall B, Stang A: Estimates of excess mortality during the COVID-19 pandemic strongly depend on subjective methodological choices. Herz 2023; 48: 180–3 CrossRef |
e2. | Levitt M, Zonta F, Ioannidis JPA: Excess death estimates from multiverse analysis in 2009–2021. Eur J Epidemiol 2023; 38: 1129–39 CrossRef |