Reply to Elson and Cummins: Uncertainty at the heart

By Baptiste Scancar, Jennifer A. Byrne, David Causeur, Adrian G. Barnett

February 9, 2026

Reply to: Positive predictive values are sensitive to specificity

We thank Elson and Cummins for their commentary¹ on our study². They raise important points regarding uncertainty in estimating the prevalence of paper mill papers and the positive predictive value. Elson and Cummins correctly object that the reported 9.87% rate is not and should not be interpreted as epidemiological prevalence without adequate correction. While the model allows large-scale estimation of potential paper mill prevalence, these estimates should be interpreted as approximate indicators rather than precise epidemiological measurements.

Elson and Cummins present Rogan–Gladen³ corrected prevalence estimates ranging from 6.7% to 10.4%, based on sensitivity and specificity estimates derived from the internal and external validation sets. While we acknowledge the mathematical interest of this approach, we note that: (i) as noted by Elson and Cummins, this correction assumes knowledge of the true sensitivity and specificity; (ii) the reliability of the ground truth labels is uncertain, and paper mill papers could be present among the controls; and (iii) our training set of paper mill papers is likely not representative of all types of paper mill papers. Also, the estimate based on the internal validation set (Sp = 263/275 = 0.956) is more uncertain than that based on the external validation set (Sp = 3073/3100 = 0.991), which partly explains the spread in corrected prevalence estimates. Using the estimates obtained from the external validation set yields a corrected prevalence very close to – and slightly higher than – the observed prevalence. Given these limitations, we considered Rogan–Gladen and other corrected estimates to be exploratory and chose to focus primarily on the observed prevalence.

Elson and Cummins examine the positive predictive value (PPV) and illustrate that it is sensitive to small variations in specificity, using example prevalences of 5% and 10%. We agree with this demonstration. We note, however, that it remains speculative, as it suffers from the same limitations discussed above: uncertainty in test performance metrics and prevalence estimates, and limited generalisability from a small test set. These observations from Elson and Cummins highlight the risk of false positives and reinforce the importance of external human verification of any papers flagged by the model.

Elson and Cummins note that model performance varied substantially between validation datasets, specifically referencing the set of 873 papers with misidentified nucleotide sequences and/or cell lines. While this set was described as a validation dataset, it was also presented as an illustrative example. As it has no defined ground truth labels, it cannot be used to derive performance metrics in the same way as a formal validation dataset. We note that model performance was broadly consistent across the two formal validation datasets.

Elson and Cummins argue that our uncertainty interval around the 9.87% figure primarily reflects sampling variability in a very large dataset and risks being overinterpreted as high certainty about prevalence. We agree and although the calculation method was described, there remains a risk of misinterpretation. The bootstrap confidence interval was intended only to illustrate the stability of the observed prevalence under resampling of the cancer dataset.

In summary, estimating the true prevalence of paper mill papers is challenging, and our estimates do not reflect all sources of uncertainty. We note that our model may have flagged the less sophisticated paper mill templates, whilst more bespoke papers generated by mills could avoid detection. Hence the true prevalence could be higher than our estimate.

References

1. Elson, M. & Cummins, J. Positive predictive values are sensitive to specificity. https://www.bmj.com/content/392/bmj-2025-087581/rr-3 (2026).

2. Scancar, B., Byrne, J. A., Causeur, D. & Barnett, A. G. Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study. BMJ 392, e087581 (2026).

3. Rogan, W. J. & Gladen, B. Estimating prevalence from the results of a screening test. Am. J. Epidemiol. 107, 71–76 (1978).

Posted on:: February 9, 2026

Length:: 3 minute read, 634 words

See Also: