Reply to Elson and Cummins: Uncertainty at the heart
By Baptiste Scancar, Jennifer A. Byrne, David Causeur, Adrian G. Barnett
February 9, 2026
Reply to: Positive predictive values are sensitive to specificity
We thank Elson and Cummins for their commentary1 on our study2. They raise important points regarding uncertainty in estimating the prevalence of paper mill papers and the positive predictive value. Elson and Cummins correctly object that the reported 9.87% rate is not and should not be interpreted as epidemiological prevalence without adequate correction. While the model allows large-scale estimation of potential paper mill prevalence, these estimates should be interpreted as approximate indicators rather than precise epidemiological measurements.
Elson and Cummins present Rogan–Gladen3 corrected prevalence estimates ranging from 6.7% to 10.4%, based on sensitivity and specificity estimates derived from the internal and external validation sets. While we acknowledge the mathematical interest of this approach, (i) as noted by Elson and Cummins, this correction assumes knowledge of the true sensitivity and specificity; (ii) the reliability of the ground truth labels is uncertain, and paper mill papers could be present among the controls; and (iii) our training set of paper mill papers is likely not representative of all types of paper mill papers. Also, the estimate based on the internal validation set (Sp = 263/275 = 0.956) is more uncertain than that based on the external validation set (Sp = 3073/3100 = 0.991), which partly explains the spread in corrected prevalence estimates. Using the estimates obtained from the external validation set yields a corrected prevalence very close to – and slightly higher than – the observed prevalence. Given these limitations, we considered Rogan–Gladen and other corrected estimates to be exploratory and chose to focus primarily on the observed prevalence.
Elson and Cummins examine the positive predictive value (PPV) and illustrate that it is sensitive to small variations in specificity, using example prevalences of 5% and 10%. We agree with this demonstration. We note, however, that it remains speculative, as it suffers from the same limitations discussed above: uncertainty in test performance metrics and prevalence estimates, and limited generalisability from a small test set. These observations from Elson and Cummins highlight the risk of false positives and reinforce the importance of external human verification of any papers flagged by the model.
Elson and Cummins note that model performance varied substantially between validation datasets, specifically referencing the set of 873 papers with wrongly identified nucleotide sequences and/or human cell lines. As these papers had no assigned ground truth labels, this dataset cannot be used to derive performance metrics in the same way as a formal validation dataset. We note that model performance was broadly consistent across the two formal validation datasets.
Elson and Cummins argue that our uncertainty interval around the 9.87% figure primarily reflects sampling variability in a very large dataset and risks being overinterpreted as high certainty about prevalence. We agree and although the calculation method was described, there remains a risk of misinterpretation. The bootstrap confidence interval was intended only to illustrate the stability of the observed prevalence under resampling of the cancer dataset.
In summary, estimating the true prevalence of paper mill papers is challenging, and our estimates do not reflect all sources of uncertainty. We note that our model may have flagged the less sophisticated paper mill templates, whilst more bespoke papers generated by mills could avoid detection. Hence the true prevalence could be higher than our estimate. Future paper mill research will be facilitated by the availability of additional publication datasets with reliable ground truth labels. Such datasets could be generated through improved rates of post-publication corrections, where post-publication notices transparently describe evidence for paper mill support.
References
1. Elson, M. & Cummins, J. Positive predictive values are sensitive to specificity. https://www.bmj.com/content/392/bmj-2025-087581/rr-3 (2026).
2. Scancar, B., Byrne, J. A., Causeur, D. & Barnett, A. G. Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study. BMJ 392, e087581 (2026).
3. Rogan, W. J. & Gladen, B. Estimating prevalence from the results of a screening test. Am. J. Epidemiol. 107, 71–76 (1978).
- Posted on:
- February 9, 2026
- Length:
- 4 minute read, 658 words
- See Also: