DÄ internationalArchive33-34/2021Studies for the Evaluation of Diagnostic Tests

Review article

Studies for the Evaluation of Diagnostic Tests

Part 28 of a Series on Evaluation of Scientific Publications

Dtsch Arztebl Int 2021; 118: 555-60. DOI: 10.3238/arztebl.m2021.0224

Hoyer, A; Zapf, A

Background: The accurate diagnosis of a disease is a prerequisite for its appropriate treatment. How well a medical test is able to correctly identify or rule out a target disease can be assessed by diagnostic accuracy studies.

Methods: The main statistical parameters that are derived from diagnostic accuracy studies, and their proper interpretation, will be presented here in the light of publications retrieved by a selective literature search, supplemented by the authors’ own experience. Aspects of study planning and the analysis of complex studies on diagnostic tests will also be discussed.

Results: In the usual case, the findings of a diagnostic accuracy study are presented in a 2 × 2 contingency table containing the number of true-positive, true-negative, false-positive, and true-positive test results. This information allows the calculation of various statistical parameters, of which the most important are the two pairs sensitivity/specificity and positive/negative predictive value. All of these parameters are quotients, with the number of true positive (resp. true negative) test results in the numerator; the denominator is, in the first pair, the total number of ill (resp. healthy) patients, and in the second pair, the total number of patients with a positive (resp. negative) test. The predictive values are the parameters of greatest interest to physicians and patients, but their main disadvantage is that they can easily be misinterpreted. We will also present the receiver operating characteristic (ROC) curve and the area under the curve (AUC) as additional important measures for the assessment of diagnostic tests. Further topics are discussed in the supplementary materials.

Conclusion: The statistical parameters used to assess diagnostic tests are primarily based on 2 × 2 contingency tables. These parameters must be interpreted with care in order to draw correct conclusions for use in medical practice.

LNSLNS

The diagnosis of a disease is the first step on the road to its treatment. The evaluation of the underlying diagnostic procedures is performed in what are referred to as diagnostic studies, which determine how well a diagnostic instrument, for example, a laboratory test, detects the presence of a disease.

The correct determination of the results of diagnostic tests is of central importance, since a positive result impacts not only the affected person, but—as in the SARS-CoV-2 pandemic—potentially also the social environment (1). In this context, the probability of the true presence of SARS-CoV-2 infection in patients that have tested positive is of particular importance—a probability that is also influenced by the increasing number of tests carried out in the population and by current infection rates (1, 2). Against this backdrop, it is crucial that physicians are able to correctly assess diagnostic parameters. However, misinterpretation of measured values of this kind is not new, irrespective of the test or disease, and the situation has not improved significantly over the years (3, 4, 5, 6).

Therefore, the aim of this paper is to present the various measures of accuracy of a diagnostic test and to describe the relationship between the measures in order that, after reading the article, the reader will be able to correctly interpret an individual test result.

Measures of diagnostic accuracy

In a first step, we present the diagnostic 2 × 2 contingency table and prevalence, followed by the most important parameters, sensitivity and specificity, as well as predictive values and accuracy. The equations of the empirical estimators are given in the Box, while in the text they are directly applied to an example. For diagnostic tests that yield a metric value or score rather than a binary result, we present the receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC). As a general rule, confidence intervals (CI) should also be given for all diagnostic parameters. For sensitivity, specificity, predictive values, and accuracy, we recommend logit confidence intervals, since these yield plausible results, most notably even when case numbers are small, and guarantee that the limits do not lie outside the [0;1] interval. For details, the reader is referred to the relevant literature (7, 8).

The equations of the empirical estimators
Box
The equations of the empirical estimators

Diagnostic 2 × 2 contingency table

When the test result is binary (positive versus negative), the results of a diagnostic study can be mapped in diagnostic 2 × 2 contingency tables (Table 1). Since a diagnostic test generally yields a metric value or score as its result, a cut-off value needs to be defined in order to maintain the binary coding. The diagnostic test to be evaluated will be referred to hereafter as the index test. This contrasts with the so-called gold or reference standard, which defines the “true” disease state. These two terms are often used synonymously. However, since “gold standard” is often associated with an assumption of a perfect definition of the “true” disease status, a status that is not necessarily present in practice, we have chosen to use the term reference standard below. The most reliable method to determine true disease status should be chosen as the reference standard. This is often not feasible in routine practice, for example due to the fact that it is too invasive, expensive, or time-consuming, or since it can only be used after death. Based on the results of the index test (T+ [positive] versus T [negative]) and the reference standard (D1 [with disease] versus D0 [without disease]), classification is made as true-positive (TP), true-negative (TN), false-positive (FP), or false-negative (FN). The respective row and column sums are given as n1 and n0 for the number of people with the disease and those without the disease, respectively, and by n+ and n for the number of people that tested positive and negative, respectively. N relates to the total number of study participants.

Diagnostic 2 x 2 contingency table as the result of a diagnostic study
Table 1
Diagnostic 2 x 2 contingency table as the result of a diagnostic study

Example study

For illustrative purposes, this article uses the study conducted by Papoz et al., who evaluated HbA1c as a screening marker for the diagnosis of type 2 diabetes (9). The oral glucose tolerance test (OGTT) was used as the corresponding reference standard procedure. An HbA1c of 6.5 (among other parameters), which is currently used to diagnose type 2 diabetes, was used as the diagnostic cut-off value for the index test (10). This means that study participants in whom an HbA1c of 6.5 or higher was measured were classified as positive. Table 2 shows the corresponding diagnostic 2 × 2 contingency table.

Results of the study by Papoz et al. (<a class=9) on the HbA1c cut-off value of 6.5" width="250" src="https://cf.aerzteblatt.de/bilder/135634-250-0" loading="lazy" data-bigsrc="https://cf.aerzteblatt.de/bilder/135634-1400-0" data-fullurl="https://cf.aerzteblatt.de/bilder/2021/10/img262928178.gif" />
Table 2
Results of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5

Prevalence

Prevalence plays a crucial role in the correct interpretation of test results. It denotes the proportion of individuals with disease in the studied collective and is calculated as the number of individuals with disease divided by the total number of study participants.

If we consider the study by Papoz et al. (9), we obtain the following estimated prevalence:

Prevalence = 112 / 601 ≈ 0,186 = 18,6%

The 95% logit confidence interval (CI) is [15.7%; 21.9%].

Accuracy

Accuracy is calculated from the proportion of correct results (TN and TP) out of all test results:

Accuracy = (78 + 465) / 601 ≈ 0,903 = 90,3%

The 95% logit CI is [87.7%; 92.4%].

From this it follows that 90.3% of test results were correct. However, it is not possible to assess the proportion of incorrect results among individuals who did or did not have disease, which is why this parameter is generally not recommended.

Sensitivity and specificity

Sensitivity and specificity are the most important parameters in the development of tests. These two measures indicate the proportion of individuals with or without disease in whom a correct diagnosis was made. Sensitivity is calculated as the number of true-positive test results divided by the number of individuals with disease, while specificity is calculated as the number of true-negative test results divided by the number of persons without disease.

The following values are derived for the sensitivity and specificity of the example:

Sensitivity = 78 / 112 ≈ 0,696 = 69,6%

Specificity = 465 / 489 ≈ 0,951 = 95,1%

Thus, there is a 69.6% probability that the HbA1c test will be positive if the investigated subject has type 2 diabetes (sensitivity). Conversely, the probability that the HbA1c test will be negative is 95.1% if a study participant does not have type 2 diabetes (specificity). The 95% logit CIs for sensitivity and specificity are [60.5%; 77.4%] and [92.8%; 96.7%], respectively.

Predictive values

While sensitivity and specificity are the recommended parameters for diagnostic test development (11), they are not informative for the patient and physician in routine practice. The true disease status is not known outside the study since the reference standard is not determined. The interesting information here is the probability that the disease is present in the case of a positive test result and absent in the case of a negative test result. These conclusions can be drawn with the help of the predictive values. These are calculated as the number of true-positive test results divided by the number of positive test results (positive predictive value, PPV) and as the number of true-negative test results divided by the number of negative test results (negative predictive value, NPV). Therefore, these values are conditional probabilities. The PPV indicates the probability of the disease being present in the case of a positive test result, whereas the NPV indicates the probability of the disease not being present in the case of a negative test result.

The following values are obtained for the example:

PPV = 78 / 102 ≈ 0,765 = 76,5%

NPV = 465 / 499 ≈ 0,932 = 93,2%

Thus, in the case of a positive HbA1c test result, the risk of suffering from type 2 diabetes is 76.5%. On the other hand, there is a 93.2% probability that type 2 diabetes is not present if the HbA1c test result is negative. The corresponding 95% logit CIs are [67.3%; 83.7%] for the PPV and [90.6%; 95.1%] for the NPV. However, these results should be viewed with caution since predictive values, unlike sensitivity and specificity, depend on prevalence.

Receiver operating characteristic curve

Diagnostic studies often evaluate not only one cut-off value to classify test positives and negatives, but rather several in order to determine an optimal diagnostic threshold for clinical practice. This is associated with different pairs of sensitivities and specificities, which belong to the respective threshold under evaluation. Papoz et al. (9) investigated a total of five different HbA1c cut-off values between 5.0 and 7.0. The corresponding sensitivity and specificity was determined for each of these values (Table 3).

Cut-off values evaluated by Papoz et al.
Table 3
Cut-off values evaluated by Papoz et al.

The ROC curve was used to better depict the results of the study. Thus, for each cut-off value investigated, sensitivity is plotted on the y-axis and 1-specificity on the x-axis of a graph (Figure).

ROC curve in the study by Papoz et al.
Figure
ROC curve in the study by Papoz et al.

One criterion to select a cut-off value is the Youden index. This is calculated as the sum of sensitivity and specificity in percentage points minus 100. The cut-off value with the highest Youden index is often considered to be optimal. In the example study, this would be 6.0 (underlined value on the Figure with a Youden index of 0.672, Table 3).

In its classical form, the Youden index assumes an equal weighting of sensitivity and specificity and, thus, also an equal weighting of false-positive and false-negative test results. However, for a screening test, sensitivity in particular should be high, whereas for a confirmatory test, specificity should be high. In order to determine optimal cut-off values for these types of diagnostic tests, it is recommended that a minimum required sensitivity and specificity be determined prior to starting the study. Alternatively, a weighted Youden index can be used, whereby sensitivity or specificity are given a higher weight.

In particular, sensitivity and specificity depend on the selected cut-off value (Table 3). The higher the HbA1c cut-off value, the greater the specificity, but the lower the sensitivity. This means, conversely, that any sensitivity can be achieved if a correspondingly low specificity is accepted and vice versa. For this reason, the European and US guidelines on diagnostic agents (European Medicines Agency, EMA [11], Food and Drug Administration, FDA [12]) recommend using sensitivity and specificity as primary endpoints.

Area under the curve

The area under the curve, the AUC, is suited to comparing the overall accuracy of one or more diagnostic tests. It indicates the probability that a person with disease has a higher test value than a person without disease, assuming high values indicate the presence of the disease.

For the example study, we obtain an AUC of 91.4%, meaning that there is a 91.4% probability that individuals with type 2 diabetes will have a higher HbA1c than individuals without type 2 diabetes. The higher the AUC, the better the new diagnostic test discriminates between individuals with and individuals without disease. The maximum value for the AUC is 100%. If the AUC is 50%, the test is useless and comparable to the toss of a coin. AUC values below 50% mean that low rather than high values suggest that the disease is present.

Dependence of predictive values on prevalence

Predictive values, in contrast to sensitivity and specificity, depend on prevalence. This becomes apparent if we artificially modify the study results obtained by Papoz et al. (9), as in Table 4. These results might be obtained if the test is not used as a screening test in an at-risk population, but rather as a confirmatory test in individuals with suspected type 2 diabetes. To do this, we multiplied the number of individuals with type 2 diabetes (TP, FN, and n1, respectively) by 10, but left the number of individuals without type 2 diabetes unchanged. This yields a prevalence of 69.9% and the following values:

Artificially modified result of the study by Papoz et al. (<a class=9) on the HbA1c cut-off value of 6.5" width="250" src="https://cf.aerzteblatt.de/bilder/135636-250-0" loading="lazy" data-bigsrc="https://cf.aerzteblatt.de/bilder/135636-1400-0" data-fullurl="https://cf.aerzteblatt.de/bilder/2021/10/img262928182.gif" />
Table 4
Artificially modified result of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5

Sensitivity = 780 / 1120 ≈ 0,696 = 69,6%

Specificity = 465 / 489 ≈ 0,951 = 95,1%

PPV = 780 / 804 ≈ 0,979= 97,9%

NPV = 465 / 805 ≈ 0,578= 57,8%

Even after increasing the number of individuals with the disease, the sensitivity remains unchanged. However, the positive predictive value increases from 76.5% [67.3%; 83.7%] to 97.9% [96.6%; 98.7%], while the negative predictive value drops from 93.2% [90.6%; 95.1%] to 57.8% [54.4%; 61.2%]. The generally valid result becomes evident: sensitivity and specificity are independent of prevalence, but the predictive values are not. Therefore, when interpreting predictive values, the prevalence of the disease in the target population for which a new diagnostic test is intended to be used must be taken into consideration. If the study population is a representative sample of the target population and the study participants are appropriately selected, this is assured and the predictive values are interpretable. If study prevalence and target population prevalence do not match, predictive values can be determined by using Bayes’ theorem:

PPV = (Pr × Se) / (Pr × Se + (1 – Pr ) × (1 – Sp ))

NPV = ((1 – Pr ) × Sp) / ((1 – Pr ) × Sp + Pr × (1 – Se ))

Here, Se and Sp denote the sensitivity and specificity of the diagnostic test under evaluation, while Pr denotes the prevalence of the disease in the target population. Assuming a prevalence of type 2 diabetes of 18.6%, as found in the study by Papoz et al. (9), we obtain the following results:

PPV = (0,186 × 0,696) / (0,186 × 0,696 + (1 – 0,186) × (1 – 0,951)) ≈ 0,765

NPV = ((1 – 0,186) × 0,951) / ((1 – 0,186) × 0,951 + 0,186 × (1 – 0,696)) ≈ 0,932

These are in agreement with the results determined on the basis of the 2 x 2 contigency table.

In order to the determine the predictive value for a different target population, the prevalence can be adjusted accordingly. If we assume that the predictive values of the HbA1c test for screening type 2 diabetes are to be estimated in the entire German adult population, we would use the prevalence of type 2 diabetes in Germany, which was approximately 9.5% in 2015 (13):

PPV = (0,095 × 0,696) / 0,095 × 0,696 + (1 – 0,095) × (1 – 0,951)) ≈ 0,599

NPV = ((1 – 0,095) × 0,951) / ((1 – 0,095) × 0,951 + 0,095 × (1 – 0,696)) ≈ 0,967

This means that any adult person in Germany with a positive HbA1c test would have a 59.9% probability of developing type 2 diabetes, and if the test result was negative, a 96.9% probability of not developing type 2 diabetes. The positive predictive value in particular needs to be viewed critically, since it implies that of 100 individuals that test positive, only around 60 actually have diabetes. As such, one would expect approximately 40 false-positive test results, which may lead to unnecessary further diagnostic tests or treatment. One should also question in a critical manner whether the extrapolation of sensitivity and specificity from the study by Papoz et al. (9) is plausible. The assumption here is that sensitivity and specificity are the same in all scenarios. However, it is conceivable that a test could, for example, differentiate individuals with and without severe disease better than those with suspected disease and mild disease. Although sensitivity and specificity do not depend on prevalence, they do depend on disease pattern. It should additionally be noted that prevalence is also determined on the basis of studies, and thus associated with uncertainty. This needs to be taken into consideration when interpreting predictive values, and underscores the importance of confidence intervals.

Discussion

Diagnostic studies are the basis for the evaluation of diagnostic tests. As such, they form the bedrock of the resulting treatment or preventive measures. The correct interpretation of results obtained in these types of studies is vital in order to be able to evaluate the benefit of a new diagnostic procedure.

We have presented the most important parameters for the interpretation of diagnostic studies. These include sensitivity and specificity, which are primarily of interest from a study perspective, since they describe the accuracy of the diagnostic test under evaluation if the “true” disease status is known and are independent of prevalence. Predictive values, on the other hand, are of particular importance from a practical and clinical perspective. These indicate the probability that a disease is present or absent in the case of a positive or negative test result. As such, they reflect the situation in everyday clinical practice, but are dependent on disease prevalence, which needs to be taken into account when interpreting the values. Even a positive result using a test with extremely high sensitivity and specificity is highly likely to be a false-positive result if prevalence is very low

These parameters form the basis for the planning and analysis of more complex diagnostic studies (7, 14). An understanding of the measures used to evaluate a new diagnostic procedure and the critical interpretation of these measures are essential for the procedure’s practical evaluation and application.

The additional diagnostic parameters (diagnostic likelihood and odds ratios), as well as the further aspects of confirmatory diagnostic accuracy studies (for example, hypotheses and sample size determination), sources of bias, and study quality presented in the eMethods Section enable careful planning and a more differentiated evaluation of diagnostic studies.

Conflict of interests
The authors declare that no conflict of interests exists.

Manuscript submitted on 13 February 2021, revised version accepted on 26 April 2021

Translated from the original German by Christine Rye.

Corresponding author
Prof. Dr. Annika Hoyer
Institut für Statistik, Ludwig-Maximilians-Universität München
Ludwigstraße 33, 80539 München, Germany
annika.hoyer@stat.uni-muenchen.de

Cite this as
Hoyer A, Zapf A:
Studies for the evaluation of diagnostic tests—part 28 of a series
on evaluation of scientific publications. Dtsch Arztebl Int 2021; 118: 555–60.
DOI: 10.3238/arztebl.m2021.0224

Supplementary material

eReference, eMethods, eTable:
www.aerzteblatt-international.de/m2021.0224

Table listing and describing the various potential sources of bias, modified from (<a class=14)" width="250" src="https://cf.aerzteblatt.de/bilder/135631-250-0" loading="lazy" data-bigsrc="https://cf.aerzteblatt.de/bilder/135631-1400-0" data-fullurl="https://cf.aerzteblatt.de/bilder/2021/10/img262928172.gif" />
eTable
Table listing and describing the various potential sources of bias, modified from (14)
1.
Schlenger RL: PCR-Tests auf SARS-CoV-2: Ergebnisse richtig interpretieren. Dtsch Arztebl 2020; 117: 1194 VOLLTEXT
2.
Lein I, Leuker C, Antao EM, et al.: SARS-CoV-2: Testergebnisse richtig einordnen. Dtsch Arztbl 2020; 117: 2304 VOLLTEXT
3.
Gigerenzer G, Hoffrage U, Ebert A: AIDS counselling for low-risk clients. AIDS Care 1998; 10: 197–211 CrossRef
4.
Eddy DM: Probabilistic reasoning in clinical medicine: problems and opportunities. In: In D. Kahneman, P. Slovic, & A. Tversky (eds.): Judgment under uncertainty Heuristic and Biases. Cambridge: Cambridge University Press 1982; 249–267.
5.
Gigerenzer G, Wegwarth O: [Medical risk assessment—using the example of cancer screening]. Z Evid Fortbild Qual Gesundhwes 2008; 102: 513–9.
6.
Ellis KM, Brase GL: Communicating HIV results to low-risk Individuals: Still hazy after all these years. Curr HIV Res 2015; 13: 381–90 CrossRef
7.
Pepe, MS (ed.): The statistical evaluation of medical tests for classification and prediction. Oxford University Press: Oxford 2003.
8.
Agresti A (ed.): Categorical data analysis, 3rd edition. Wiley series in probability and statistics. New Jersey: John Wiley & Sons, Inc., Hoboken 2013; 90–112.
9.
Papoz L, Favier F, Sanchez, et al.: Is HbA1c appropriate for the screening of diabetes in general practice? Diabetes Metab 2002; 28: 72–7.
10.
American Diabetes Association: Classification and diagnosis. Sec. 2. In: Standards of medical care in diabetes. Diabetes Care 2015; 38: 8–16.
11.
EMA 2010: Guideline on clinical evaluation of diagnostic agents. Doc. Ref. CPMP/ EWP/1119/98/Rev.1. www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003580.pdf (last accessed on 3 November 2020).
12.
FDA 2004: Developing Medical Imaging Drug and Biological Products Part 3: Design, Analysis, and Interpretation of Clinical Studies 2004. www.fda.gov/regulatory-information/search-fda-guidance-documents/developing-medical-imaging-drug-and-biological-products-part-3-design-analysis-and-interpretation (last accessed on 3 November 2020).
13.
Goffrier B, Schulz M, Bätzing-Feigenbaum J: Administrative Prävalenzen und Inzidenzen des Diabetes mellitus von 2009 bis 2015. Versorgungsatlas-Bericht Nr. 17/03. Berlin: Zentralinstitut für die kassenärztliche Versorgung in Deutschland (Zi) 2017.
14.
Zhou XH, McClish DK, Obuchowski NA (eds.): Statistical methods in diagnostic medicine (Vol. 569). New York: John Wiley & Sons 2011.
e1.
Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt PM: Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses. Diagn Progn Res 2019; 3: 22 CrossRef
e2.
Stark M, Zapf A: Sample size calculation and re-estimation based on the prevalence in a single-arm confirmatory diagnostic accuracy study. Stat Methods Med Res 2020; 29: 2958–71 CrossRef
e3.
Newcombe RG: Two-sided confidence intervals for the single proportion: comparison of sevenmethods. Stat Med 1998; 7: 857–72.
e4.
Newcombe RG: Improved confidence intervals for the difference between binomial proportions based on paired data. Stat Med 1998; 17: 2635–50 CrossRef
e5.
STARD (2015): An updated list of essential items for reporting diagnostic accuracy studies. www.equator-network.org/reporting-guidelines/stard (last accessed on July 1, 2021).
e6.
Rabe-Hesketh S, Skrondal A: Multilevel and longitudinal modeling using Stata. Volume II: Categorical responses, counts, and survival. College Station: STATA press 2008.
e7.
Oosterhuis WP, Venne WPV, Deursen CTV, Stoffers HE, Acker BAV, Bossuyt PM: Reflective testing – a randomized controlled trial in primary care patients. Ann Clin Biochem 2021; 58: 78–85 CrossRef
e8.
van den Berk IAH, Kanglie MMNP, van Engelen TSR, et al.: OPTimal IMAging strategy in patients suspected of non-traumatic pulmonary disease at the emergency department: chest X-ray or ultra-low-dose CT (OPTIMACT)-a randomised controlled trial chest X-ray or ultra-low-dose CT at the ED: design and rationale. Diagn Progn Res 2018; 2: 20 CrossRef
e9.
Aviv JE: Prospective, randomized outcome study of endoscopy versus modified barium swallow in patients with dysphagia. Laryngoscope 2000; 110: 563–74 CrossRef
e10.
Fryback DG, Thornbury JR: The efficacy of diagnostic imaging. Med Decis Making 1991; 11: 88–94 CrossRef
e11.
Koebberling J, Trampisch HJ, Windeler J: Memorandun for the evaluation of diagnostic measures. J Clin Chem Clin Biochem 1990; 28: 873–9.
e12.
Lu B, Gatsonis C: Efficiency of study designs in diagnostic randomized clinical trials.Stat Med 2013; 32:1451–66 CrossRef
e13.
Zapf A, Stark M, Gerke O, et al.: Adaptive trial designs in diagnostic accuracy research. Stat Med 2020; 39: 591–601 CrossRef
e14.
Vach W, Bibiza E, Gerke O, Bossuyt PM, Friede T, Zapf A: A potential for seamless designs in diagnostic research could be identified. J Clin Epidemiol 2020; 129: 51–9.
e15.
Gerke O, Høilund-Carlsen PF, Poulsen MH, Vach W: Interim analyses in diagnostic versus treatment studies: differences and similarities. Am J Nucl Med Mol Imaging 2012; 2: 344–52.
e16.
Mazumdar M, Liu A: Group sequential design for comparative diagnostic accuracy studies. Stat Med 2003; 22: 727–39 CrossRef
e17.
Chu H, Cole SR: Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. J Clin Epi 2006; 59: 1331–2 CrossRef
e18.
Rutter CM, Gatsonis CA: A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 2001; 20: 2865–84 CrossRef
e19.
Biondi-Zoccai (ed.): Diagnostic meta-analysis – a useful tool for clinical decision-making. Cham: Springer-Verlag 2019.
Department of Statistics, Ludwig-Maximilians-University Munich: Prof. Dr. Annika Hoyer
Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf: Prof. Dr. Antonia Zapf
The equations of the empirical estimators
Box
The equations of the empirical estimators
ROC curve in the study by Papoz et al.
Figure
ROC curve in the study by Papoz et al.
Diagnostic 2 x 2 contingency table as the result of a diagnostic study
Table 1
Diagnostic 2 x 2 contingency table as the result of a diagnostic study
Results of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5
Table 2
Results of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5
Cut-off values evaluated by Papoz et al.
Table 3
Cut-off values evaluated by Papoz et al.
Artificially modified result of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5
Table 4
Artificially modified result of the study by Papoz et al. (9) on the HbA1c cut-off value of 6.5
Table listing and describing the various potential sources of bias, modified from (14)
eTable
Table listing and describing the various potential sources of bias, modified from (14)
1.Schlenger RL: PCR-Tests auf SARS-CoV-2: Ergebnisse richtig interpretieren. Dtsch Arztebl 2020; 117: 1194 VOLLTEXT
2.Lein I, Leuker C, Antao EM, et al.: SARS-CoV-2: Testergebnisse richtig einordnen. Dtsch Arztbl 2020; 117: 2304 VOLLTEXT
3.Gigerenzer G, Hoffrage U, Ebert A: AIDS counselling for low-risk clients. AIDS Care 1998; 10: 197–211 CrossRef
4.Eddy DM: Probabilistic reasoning in clinical medicine: problems and opportunities. In: In D. Kahneman, P. Slovic, & A. Tversky (eds.): Judgment under uncertainty Heuristic and Biases. Cambridge: Cambridge University Press 1982; 249–267.
5.Gigerenzer G, Wegwarth O: [Medical risk assessment—using the example of cancer screening]. Z Evid Fortbild Qual Gesundhwes 2008; 102: 513–9.
6.Ellis KM, Brase GL: Communicating HIV results to low-risk Individuals: Still hazy after all these years. Curr HIV Res 2015; 13: 381–90 CrossRef
7.Pepe, MS (ed.): The statistical evaluation of medical tests for classification and prediction. Oxford University Press: Oxford 2003.
8.Agresti A (ed.): Categorical data analysis, 3rd edition. Wiley series in probability and statistics. New Jersey: John Wiley & Sons, Inc., Hoboken 2013; 90–112.
9.Papoz L, Favier F, Sanchez, et al.: Is HbA1c appropriate for the screening of diabetes in general practice? Diabetes Metab 2002; 28: 72–7.
10.American Diabetes Association: Classification and diagnosis. Sec. 2. In: Standards of medical care in diabetes. Diabetes Care 2015; 38: 8–16.
11.EMA 2010: Guideline on clinical evaluation of diagnostic agents. Doc. Ref. CPMP/ EWP/1119/98/Rev.1. www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003580.pdf (last accessed on 3 November 2020).
12.FDA 2004: Developing Medical Imaging Drug and Biological Products Part 3: Design, Analysis, and Interpretation of Clinical Studies 2004. www.fda.gov/regulatory-information/search-fda-guidance-documents/developing-medical-imaging-drug-and-biological-products-part-3-design-analysis-and-interpretation (last accessed on 3 November 2020).
13.Goffrier B, Schulz M, Bätzing-Feigenbaum J: Administrative Prävalenzen und Inzidenzen des Diabetes mellitus von 2009 bis 2015. Versorgungsatlas-Bericht Nr. 17/03. Berlin: Zentralinstitut für die kassenärztliche Versorgung in Deutschland (Zi) 2017.
14.Zhou XH, McClish DK, Obuchowski NA (eds.): Statistical methods in diagnostic medicine (Vol. 569). New York: John Wiley & Sons 2011.
e1.Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt PM: Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses. Diagn Progn Res 2019; 3: 22 CrossRef
e2.Stark M, Zapf A: Sample size calculation and re-estimation based on the prevalence in a single-arm confirmatory diagnostic accuracy study. Stat Methods Med Res 2020; 29: 2958–71 CrossRef
e3.Newcombe RG: Two-sided confidence intervals for the single proportion: comparison of sevenmethods. Stat Med 1998; 7: 857–72.
e4.Newcombe RG: Improved confidence intervals for the difference between binomial proportions based on paired data. Stat Med 1998; 17: 2635–50 CrossRef
e5.STARD (2015): An updated list of essential items for reporting diagnostic accuracy studies. www.equator-network.org/reporting-guidelines/stard (last accessed on July 1, 2021).
e6.Rabe-Hesketh S, Skrondal A: Multilevel and longitudinal modeling using Stata. Volume II: Categorical responses, counts, and survival. College Station: STATA press 2008.
e7.Oosterhuis WP, Venne WPV, Deursen CTV, Stoffers HE, Acker BAV, Bossuyt PM: Reflective testing – a randomized controlled trial in primary care patients. Ann Clin Biochem 2021; 58: 78–85 CrossRef
e8.van den Berk IAH, Kanglie MMNP, van Engelen TSR, et al.: OPTimal IMAging strategy in patients suspected of non-traumatic pulmonary disease at the emergency department: chest X-ray or ultra-low-dose CT (OPTIMACT)-a randomised controlled trial chest X-ray or ultra-low-dose CT at the ED: design and rationale. Diagn Progn Res 2018; 2: 20 CrossRef
e9.Aviv JE: Prospective, randomized outcome study of endoscopy versus modified barium swallow in patients with dysphagia. Laryngoscope 2000; 110: 563–74 CrossRef
e10.Fryback DG, Thornbury JR: The efficacy of diagnostic imaging. Med Decis Making 1991; 11: 88–94 CrossRef
e11.Koebberling J, Trampisch HJ, Windeler J: Memorandun for the evaluation of diagnostic measures. J Clin Chem Clin Biochem 1990; 28: 873–9.
e12.Lu B, Gatsonis C: Efficiency of study designs in diagnostic randomized clinical trials.Stat Med 2013; 32:1451–66 CrossRef
e13.Zapf A, Stark M, Gerke O, et al.: Adaptive trial designs in diagnostic accuracy research. Stat Med 2020; 39: 591–601 CrossRef
e14.Vach W, Bibiza E, Gerke O, Bossuyt PM, Friede T, Zapf A: A potential for seamless designs in diagnostic research could be identified. J Clin Epidemiol 2020; 129: 51–9.
e15.Gerke O, Høilund-Carlsen PF, Poulsen MH, Vach W: Interim analyses in diagnostic versus treatment studies: differences and similarities. Am J Nucl Med Mol Imaging 2012; 2: 344–52.
e16.Mazumdar M, Liu A: Group sequential design for comparative diagnostic accuracy studies. Stat Med 2003; 22: 727–39 CrossRef
e17.Chu H, Cole SR: Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. J Clin Epi 2006; 59: 1331–2 CrossRef
e18.Rutter CM, Gatsonis CA: A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 2001; 20: 2865–84 CrossRef
e19.Biondi-Zoccai (ed.): Diagnostic meta-analysis – a useful tool for clinical decision-making. Cham: Springer-Verlag 2019.