Keywords: Bayes’ theorem, clinical, diagnosis, epidemiology, medical test, overall accuracy, screening

Published online 2018 October 4. doi: 10.5041/RMMJ.10351 Special Issue Celebrating the 80th Anniversary of Rambam Health Care Campus
Detection and Diagnostic Overall Accuracy Measures of Medical Tests ^{1}Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada^{2}School of Public Health, Faculty of Social Welfare and Health Sciences, University of Haifa, Haifa, IsraelCopyright: © Grunau and Linn. This is an open-access article. All its content, | ||||||||||||||||

Background Overall accuracy measures of medical tests are often used with unclear interpretations.
Objectives To develop methods of calculating the overall accuracy of medical tests in the patient population.
Methods Algebraic equations based on Bayes’ theorem.
Results A new approach is proposed for calculating overall accuracy in the patient population. Examples and applications using published data are presented.
Conclusions The overall accuracy is the proportion of the correct test results. We introduce a clear distinction between the overall accuracy measures of medical tests that are aimed at the detection of a disease in a screening of populations for public health purposes in the general population and the overall accuracy measures of tests aimed at determining a diagnosis in individuals in a clinical setting. We show that the overall detection accuracy measure is obtained in a specific study that explores test accuracy among persons with known diagnoses and may be useful for public health screening tests. It is different from the overall diagnostic accuracy that could be calculated in the clinical setting for the evaluation of medical tests aimed at determining the individual patients’ diagnoses. We show that the overall detection accuracy is constant and is not affected by the prevalence of the disease. In contrast, the overall diagnostic accuracy changes and is dependent on the prevalence. Moreover, it ranges according to the ratio between the sensitivity and specificity. Thus, when the sensitivity is greater than the specificity, the overall diagnostic accuracy increases with increasing prevalence, and vice versa, that is, when the sensitivity is lower than the specificity, the overall diagnostic accuracy decreases with increasing prevalence so that another test might be more useful for diagnostic procedures. Our paper suggests a new and more appropriate methodology for estimating the overall diagnostic accuracy of any medical test. This may be important for helping clinicians avoid errors.
Keywords: Bayes’ theorem, clinical, diagnosis, epidemiology, medical test, overall accuracy, screening | ||||||||||||||||

The accuracy of medical tests is important for minimizing errors and their possible sequelae. “Accuracy of a diagnostic test” is a term that is frequently used loosely to describe the evaluation of a medical test versus a gold standard Alberg et al. We suggest that there should be a clear distinction between the overall accuracy measures of a test aimed at the | ||||||||||||||||

DETECTION MEASURES IN A SELECTED STUDY POPULATION (TABLE 1)
The assessment of a diagnostic test is frequently based on a study in a selected population, sampled according to the disease status, and is determined according to the gold standard. The study is used for calculating the sensitivity and specificity (see Table 1). S]), and thus only the totals in the columns are meaningful. The data in this table are defined by the test performance among already diagnosed persons (with or without a disease). These data are important for detecting a disease in a population and are useful in a public health setting and for decision making. For example, one may evaluate how many of the sick and healthy persons may be detected by a test for a disease among passengers in a transportation vehicle, and thus assess the resources needed in various public health and disease control settings. Such data are useful for choosing the appropriate (that is, the most efficient and least costly) test in a given population with a known and constant disease prevalence._{NEG}
Measures in Table 1 Sensitivity is defined as a/(a+c), which is the probability (P) of the test correctly identifying as test-positive (T) a patient with a sickness (_{POS}S). This is the proportion of correct positive diagnoses among all patients with the disease (Table 1). Specificity is defined as _{POS}d/(b+d), which is the proportion of correct negative test-based diagnoses (T) among all healthy individuals without the sickness (_{NEG}S).
_{NEG}Note that the prevalence of the disease in Table 1 is artificially determined by the researcher, according to the number of persons with the disease ( The (artificial) study prevalence in Table 1 is thus:
Note that this is not the disease prevalence in the patient population of interest but rather that in the specific study population determined solely by the researcher. These numbers are artificially determined by the researcher in specific studies in which persons with and without a known diagnosis of a disease are sampled, and they may be influenced by a myriad of considerations, including budget, availability of patients, convenience, and time limitations. Thus, the sensitivity and specificity are not important for clinicians, as these are measured in artificial data and have no relevance to the diagnosis or treatment of patients. “Overall Detection Accuracy” of a Diagnostic Test Calculable in a Specific Study Population (Table 1) The overall accuracy of a diagnostic test is commonly calculated in a specific study as:
This overall accuracy measure indicates the overall detection of persons with or without a disease in a population. It indicates how many persons with and without a disease could be correctly identified and is dependent on the disease prevalence in the specific sample used, which is artificially determined and could be different from the true prevalence of the disease in the entire study population. Thus, it is not necessarily transferrable to other populations with a different prevalence of the disease. This measure can be written in another way (for the derivation, see the Additional Material). Let us observe the (artificial) disease prevalence odds (
Thus,
It follows that the commonly used “accuracy” or “overall accuracy” measure is in fact a weighted average of the sensitivity and specificity, with weights that are the artificially determined numbers of persons with a disease, To demonstrate this, let us consider three situations: -
The first is a study with an equal number of persons with and without a disease,
*a*+*c=b*+*d*, and thus*x*=1 (e.g. 100 sick and 100 healthy persons are studied). The (artificial) prevalence in such a study is 50%, which is rarely the true prevalence of the disease in the population of interest. In such a study, the overall detection accuracy will be in fact an average value of the specificity and sensitivity. -
If a disease is rare, that is, if a study is designed with more persons without than with a disease, and thus
*x*<1, the resulting overall accuracy measure is more heavily dependent on the specificity. -
Conversely, for a common disease, a study designed with more persons with than without a disease, and thus
*x*>1, will lead to an overall accuracy measure that is more heavily dependent on the sensitivity.
Thus, the size of the study groups leads to a biased and potentially misleading measure of the “overall accuracy” if calculated based on Table 1. | ||||||||||||||||

The overall detection accuracy mentioned above is dependent on an artificial prevalence of the disease, as in Table 1, and thus is not applicable to an individual in a patient population. Thus, the ability of a test to diagnose a disease or the absence of a disease is evaluated in a different table that is relevant to the general patient population and the physician (Table 2). In this situation, the population is sampled according to the test results, whether positive or negative.
Measures in the Patient Population The measure of interest for health providers, physicians, and patients alike is usually the positive predictive value (PPV) or negative predictive value (NPV) of the test. The vertical line (|) denotes “given” and thus P(S|_{POS}T) denotes the probability of being sick, _{POS}S, given that the test is positive, _{POS}T.
_{POS}The PPV is defined as:
Similarly, the NPV is defined as the success percentage when the clinical test is used to diagnose the absence of a disease:
Frequently, we do not have the information needed to construct Table 2 or to calculate the PPV and the NPV directly, because it is often unfeasible or unethical to perform both the diagnostic tests and an additional more invasive definitive test to determine the true diagnosis according to the gold standard (e.g. the results of a stress test would not always justify cardiac catheterization). The translation of information on sensitivity and specificity to PPV or NPV, that is, the calculation of Table 2 from the data in Table 1, must be done using an equation based on Bayes’ theorem that uses the clinician’s prior knowledge of the probability of a disease (based on the prevalence) to calculate the probability that a test yields correct results. This equation is based on the true prevalence P) of the sickness (S) in the population (Equation 7).
Note also that,
Similarly,
Let us note also that,
A Clinical Measure of Overall Accuracy, the Overall Diagnostic Accuracy Measure Calculable in the Patient Population To estimate the average success percentage of diagnosing a disease correctly in a person in the patient population, we should calculate the overall diagnostic accuracy, which describes the accuracy of our ability to diagnose correctly a disease in the patient population, or the absence of the disease. This is calculable in Table 2 as the percentage of correct diagnoses yielded by the test:
| ||||||||||||||||

We now show that the diagnostic accuracy is based on the patient population disease prevalence together with the sensitivity and specificity. This leads to an equation that has already been developed by Alberg et al. Application of Sensitivity in the Patient Population The number of people with a disease who would be detected by a test in the patient population is obtained by multiplying the probability of detecting a person with a disease (the sensitivity) by the true disease prevalence in the patient population.
Thus,
Application of Specificity in the Patient Population The number of people without a disease who would be detected by a test in the patient population is obtained by multiplying the probability of detecting a person without a disease (the specificity) by the true prevalence of non-disease (which is 1–prevalence) in the patient population.
Thus,
Overall Diagnostic Accuracy Expressed by the Sensitivity, Specificity, and the Prevalence Thus, we can derive the overall diagnostic accuracy of the test in the patient population using the summary of the probabilities as:
For illustration, according to this equation the overall diagnostic accuracy ranges according to the sensitivity (when the prevalence is 1) and the specificity (when the prevalence is 0). When When the prevalence is 50%, the overall diagnostic accuracy is the average of the sensitivity and specificity. Inter-relationship of Prevalence, Sensitivity, and Specificity From Equation 14, we obtain Equation 15:
Thus, for a test with a given sensitivity and specificity, there are three possible situations, depending on the prevalence: -
When
*sensitivity*>*specificity*, the overall diagnostic accuracy**increases**with increasing prevalence -
When
*sensitivity*<*specificity*, the overall diagnostic accuracy**decreases**with increasing prevalence -
When
*sensitivity*=*specificity*, the overall diagnostic accuracy is constant and equals the specificity or the sensitivity, at any prevalence.
Demonstration that Equation 14 is Identical to Equation 11 From Equation 7, we obtain Equation 16:
From Equation 9, we obtain Equation 17:
Thus, by combining Equation 15 and Equation 16, we obtain Alberg et al.’s equation (Eq. 18):
Thus, substituting P(T) we obtain Equation 19:_{NEG}
An explanation of how to estimate the difference between the two measures of overall accuracy is provided in the Additional Material. | ||||||||||||||||

As has been explained, the overall detection accuracy of a test that is calculable using the data of a specific study (Table 1) is not applicable to the patient population, because the prevalence of the disease is artificial and dependent on the number of persons with and without a disease who are recruited to a specific study, a choice that is made by the researcher according to cost, sample availability, and practical considerations. In contrast, the data in Table 2 are of interest to the patient (and the physician). These data serve to answer the following clinical questions. When the test is positive, what is the probability that the patient has the disease? (Answerable by the PPV, Equation 5). When the test is negative, what is the probability that the patient does not have the disease? (Answerable by the NPV, Equation 6). Regarding the test in the patient population, the clinical question is: What is the overall diagnostic accuracy? This question is answerable by our new suggested measure in Equation 11. In contrast to the overall detection accuracy, which is based on an artificially determined prevalence in a specific study and thus may be meaningless, we suggest that calculating the prevalence, is the detection accuracy identical to the diagnostic accuracy (see the Additional Material).
| ||||||||||||||||

Let us consider a well-known example given by Sackett et al. Originally, the example was designed to demonstrate the importance of prevalence for determining the Table 3 displays the data originally given by Sackett et al. (Table 10 in their book), Thus, the sensitivity is 60.35% and the specificity is 91.06%, and the calculated “overall detection accuracy” is 71.14%, regardless of the prevalence in the patient population. Note that the prevalence in this particular example is 227/350=64.9%. However, this is an arbitrary and artificial prevalence, determined by researchers in a specific study, which does not reflect the real prevalence in potential patient populations I, II, or III. Had the researchers chosen to use a different prevalence in their study, the calculated accuracy would be different. Thus, the overall detection accuracy above is neither informative nor suitable for evaluating a test in a patient population having a different disease prevalence. Using the above data, we can calculate an appropriate Table 2 for each specific patient population using their true prevalence (see the Additional Material). Table 3 demonstrates that the overall diagnostic accuracy of the test (ECG) varies and is dependent on the prevalence used. The diagnostic accuracy is appropriate for each of the potential patient populations having a different prevalence of the disease, and may be clinically useful for the physician and the patient. | ||||||||||||||||

Prostate cancer is common and a frequent cause of cancer death. In the United States, prostate cancer is the most commonly diagnosed visceral cancer; in 2017, there were expected to be approximately 161,000 new prostate cancer diagnoses and approximately 26,700 prostate cancer deaths. The traditional cutoff for an abnormal PSA level in major screening studies was 4.0 ng/mL. The American Cancer Society (ACS) systematically reviewed the studies in the literature that assessed the PSA test performance. We thus used the above estimates of the sensitivity and specificity and a prevalence estimate of 40% at age 50 or 80% at age 70 to calculate the overall diagnostic accuracy of the PSA test (at a cutoff level of 4 ng/mL). Table 4 demonstrates that the overall diagnostic accuracy of PSA declines dramatically from 63% at age 50 to 35% at age 70. It is thus a significantly less effective test for detecting prostate cancer in older patients. This decline in the overall diagnostic accuracy conforms with Equation 12, which predicts a decline in the overall diagnostic accuracy when the sensitivity (21% for PSA) is lower than the specificity (91% for PSA). | ||||||||||||||||

It is important to use accurate medical tests and thus avoid errors and unnecessary suffering and expenses. As already mentioned by Alberg et al. Our manuscript addresses this problem and suggests a clear distinction between the Our approach adds to the current literature, in that it may clarify the use and interpretation of test results and could avoid confusion that may result from ignoring the disease prevalence in measuring the test overall accuracy. Correct evaluation of the accuracy of medical tests may be important for helping clinicians avoid errors. | ||||||||||||||||

| ||||||||||||||||

2. Hirsch, RP.; Riegelman, RK. Statistical Operations: Analysis of Health Research Data. Oxford: Blackwell Science; 1996. 6. Sackett, DL.; Haynes, RB.; Guyatt, RH.; Tugwell, P. Clinical Epidemiology. 2nd ed. Boston, MA: Little Brown; 1991. 7. Kraemer, HC. Evaluation of Medical Tests: Objective and Quantitative Guidelines. London: Sage Publications; 1992. 8. Weiss, NS. Clinical Epidemiology: The Study of the Outcome of Illness. Oxford: Oxford University Press; 1996. 9. Riegelman, RK. Studying a Study and Testing a Test: How to Read the Medical Evidence. 4th ed. Philadelphia: Lippincott Williams & Wilkins; 2000. 10. Knottnerus, JA.; van Weel, C. General Introduction: Evaluation of Diagnostic Procedures. In: Knottnerus JA. , editor. The Evidence Base of Clinical Diagnosis. London: BMJ Books; 2002. pp. 1–18. 11. Sackett, DL.; Haynes, RB. The Architecture of Diagnostic Research. In: Knottnerus JA. , editor. The Evidence Base of Clinical Diagnosis. London: BMJ Books; 2002. pp. 19–38. 12. Pepe, MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford: Oxford University Press; 2003. (Oxford Statistical Science Series 28). 13. Zhou, XH.; Obuchowski, NA.; McClish, DK. Statistical Methods in Diagnostic Medicine. New York: Wiley-Interscience; 2002. https://doi.org/10.1002/9780470317082. 14. Rothman, KJ.; Lash, YL.; Greenland, S. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008. 15. Linn S. A new conceptual approach to teaching the interpretation of clinical tests. Journal of Statistics Education. 2004;12:3. https://doi.org/10.1080/10691898.2004.11910632. 16. Linn S, Grunau DP. New patient-oriented summary measure of net total gain in certainty for dichotomous diagnostic tests. Epidemiol Perspect Innov. 2006;3:11. https://doi.org/10.1186/1742-5573-3-11. 17. Alberg AJ, Park JW, Hager BW, Brock MV, Diener-West M. The use of “overall accuracy” to evaluate the validity of screening or diagnostic tests. J Gen Intern Med. 2004;19:460–5. https://doi.org/10.1111/j.1525-1497.2004.30091.x. 18. Fardy, JM. Evaluation of a Diagnostic Test. In: Parfrey P, Barrett B. , editors. Clinical Epidemiology Practice and Methods. Springer Protocols. New York: Humana Press; 2010. pp. 137–54. 20. Eusebi P. Diagnostic accuracy measures. Cerebrovascular Diseases. 2013;36:267–72. https://doi.org/10.1159/000353863. 21. Hoffman, RM. Elmore JG, O’Leary MP. , editors. Screening for Prostate Cancer. UpToDate. Jul2018 [accessed July 2018]. Available at: http://bit.ly/2PdBJV9. 22. Wolf AM, Wender RC, Etzioni RB, et al. American Cancer Society Prostate Cancer Advisory Committee. American Cancer Society guideline for the early detection of prostate cancer: update 2010. CA Cancer J Clin. 2010;60:70–98. https://doi.org/10.3322/caac.20066. |