When diagnostic tests are evaluated, the critical question is whether the test can provide useful information to patients or clinicians. A highly accurate test can detect disease in an early, treatable stage; determine the correct therapeutic approach; or reassure someone that they do not have a disease. In contrast, a test that is inaccurate is not only uninformative but also potentially harmful. Therefore, the evaluation of diagnostic tests focuses on various measures of accuracy.
The accuracy of new or improved diagnostics must be assessed against a reference or gold standard. This standard should be the best currently available method for diagnosing patients and should be an established practice within the medical community. Measures of accuracy assess the agreement of the new test and the reference standard.
For a qualitative test, we can start to assess accuracy by constructing a simple 2 x 2 table (Table 1). The number of patients testing positive and negative with a reference standard are subdivided by the results of the new test. A true positive is someone with the condition (as determined by the reference standard) who tests positive with the new test. A true negative is someone who does not have the condition and tests negative with the new test. In other words, those are the people who were correctly categorized by the new test. When cases are inaccurately assessed by the new test, they are called false positives or false negatives.
Sensitivity and Specificity
Sensitivity and specificity are the most commonly reported measures of diagnostic accuracy. Sensitivity is the ability of the test to correctly identify people who have the condition. Table 2 is a sample 2 x 2 table with mock data. Only the people who have the condition per the reference standard are included in the calculation of sensitivity (purple cells). Eighty-eight people have the condition in this example (60 + 28), and 60 were correctly identified by the new test. Therefore, the sensitivity of the new test is 60/88 = 0.68 (or 68%).
Specificity is the ability of the test to correctly identify those who do not have the disease or condition. Only the people who do not have the condition per the reference standard are included in the calculation (Table 2, green cells). Seven hundred and fifty people did not have the condition in this case (159 + 591), and 591 were correctly identified by the new test. Therefore, the specificity of the new test is 591/750 = 0.79 (or 79%).
Sensitivity and specificity should never be presented as a proportion or percent only: Confidence intervals must always be provided to help the reader interpret the results. In our example case, the sensitivity is 0.68 (95% CI: 0.57–0.77), and the specificity is 0.79 (95% CI: 0.76–0.82). Note that the interval is narrower for specificity because the number of patients without the condition is much higher. In addition, sensitivity and specificity should be reported both as percentages (e.g., 68%) and fractions (e.g., 60/88).
Positive and Negative Predictive Values
Calculating the positive predictive value or negative predictive value of a new test can help us understand how informative the test will be in a clinical situation. Positive predictive value (PPV) is the probability that a person testing positive for the condition with the new test actually has the condition. In other words, how often do women with suspicious findings on a mammogram actually have breast cancer? PPV is calculated using only the subset of people testing positive with the new test (Table 3, yellow cells). In our example, 219 people tested positive with the new test, and 60 of these actually have the condition. Therefore, the PPV is 60/219 = 0.27 (0.22–0.34). Again, this value must be presented with confidence intervals.
Negative predictive value (NPV) is the probability that a person testing negative for the condition with the new test really does not have the condition. In other words, how often do women with normal mammograms really not have breast cancer? NPV is calculated using only the subset of people testing negative with the new test (Table 3, pink cells). In our example, 619 people tested negative, and 591 of these did not have the condition. Therefore, the NPV is 591/619 = 0.95 (0.93–0.97).
In our sample test, the PPV (0.27) may seem low and the NPV (0.95) may seem high. But remember the application. If the test is applied to the general population, an NPV of 0.95 will mean that many people with the condition will be missed. Depending on how quickly the disease develops, what alternative tests are available, and how the disease is treated, a test with this performance may or may not be acceptable.
PPV and NPV are influenced by prevalence of the disease in the tested population. If a study enrolls 100 people with a rare disease and 100 controls, those who test positive for the rare disease are far more likely to have the disease than a random sample from the general population. Therefore, the PPV and NPV for the study will not be the same as for the general population. In our example in Table 3, the NPV is 0.95. If the sensitivity and specificity are kept the same, but twice as many people with the disease are enrolled, the NPV drops to 0.91. In some studies, information on the prevalence of the condition in the general population is used to calculate an adjusted NPV and PPV.
Likelihood Ratios
Like negative and positive predictive values, likelihood ratio pairs can be used to demonstrate the value of a new diagnostic test. The positive likelihood ratio is the probability that a person with the disease will test positive for the disease, divided by the probability that a person without the disease will test positive. In other words, it is the rate of true positives divided by the rate of false positives. The positive likelihood ratio will be greater than one for an effective test, with more effective tests having higher likelihood ratios.
In combination with the pretest probability of disease (primarily influenced by prevalence), the positive likelihood ratio can provide a clinician with the odds that the patient does have the disease. In our example (Table 2), the positive likelihood ratio is 3.2 (2.6–3.9), so the probability that an individual testing positive with the test actually has the condition are increased 3.2-fold over his or her pretest probability. This would have a small effect on the posttest probability of disease (Table 4).
The negative likelihood ratio is the rate of false positives divided by the rate of true negatives. An effective test will have a negative likelihood ratio less than one. In our example, the negative likelihood ratio is 0.40 (0.30–0.55) and would moderately decrease the posttest probability of the disease (Table 4). For comparison, a CT scan for diagnosing a condition such as appendicitis would have a negative likelihood ratio of about 0.05.
Receiver Operator Characteristic Curves
So far we have assumed that the new test results are either positive or negative. But what if the new test is quantitative? If the test is to be used to indicate the presence of the disease or to select patients for follow-up tests, a cut-off value is often used to divide patients into positive and negative groups. The cut-off point must be optimized in one study, and then validated in one or more additional studies.
As the cut-off value for a new test is increased, fewer test results are called positive, and false positives decrease but false negatives increase. Lower cut-off values result in a higher number of false positives and fewer false negatives. A receiver operator characteristic (ROC) curve illustrates the trade-off between sensitivity and specificity. The true positive rate (or sensitivity) is plotted against the false positive rate (or 1–specificity) for all possible cut-off values (Figure 1). A test that has no diagnostic value will have equivalent true and false positive rates at all cut-off values and will therefore have an ROC curve that falls along the green diagonal line. More effective tests will curve above the diagonal line (red and blue lines).
The area under the ROC curve (AUC) can be quantified to compare the performance of two different tests. An AUC of 0.5 indicates that the test has no diagnostic value. As test performance increases, the AUC will approach one. The test depicted with the red line in Figure 1 has an AUC of 0.87. The test depicted with the blue line has an AUC of 0.68.
ROC curves can be a visual tool for finding the optimal cut-off point for a new test. Two tests with the same AUC are graphed in Figure 2. The test depicted with a purple line might work well at the cut-off point that results in 70% sensitivity and 95% specificity (red arrow). The test depicted with an orange line might work well at the cut-off point that results in 85% sensitivity and 80% specificity (red arrow). The test depicted with an orange line might work well at the cut-off point that results in 85% sensitivity and 80% specificity (blue arrow). However, when setting a cut-off point, the clinical situation in which the test will be used must always be kept in mind. A test with 80% specificity might be unacceptable for screening the general population, and test developers might choose to alter the cut-off point to increase the specificity and reduce the sensitivity.
Reporting on Diagnostic Tests
When we present a new diagnostic test to regulatory agencies or journals, we must include appropriate measures accuracy (sensitivity, specificity, NPV, PPV, likelihood ratios) and the ranges of test values. If the test is quantitative, we should consider presenting an ROC curve (with AUC) and a histogram of test results in patients with and without the condition. All subjects and test results should be accounted for, so that the reader knows how many subjects were enrolled, tested, and included in the analysis. Wherever possible, confidence intervals should be included. The study design should be presented clearly so that all sources of bias and efforts to reduce bias are explicitly stated.