Peer review leaves the central claims of most biomedical preprints intact
A claim-level comparison of 72,644 bioRxiv preprints with their peer-reviewed publications, labelled by a large language model.
Hao Yin & Ruslan Rust · bioRxiv 2018–2025 · Claude Sonnet 4.6
Loading figures…
The headline · Figure 1a–b
Most primary claims survive peer review
For every pair, the model extracted the single primary claim of the abstract and judged whether it was unchanged, minorly revised, or majorly changed between the preprint and the published version. The large majority changed little.
–
Primary claim unchanged
–
Minor revision (wording only)
–
Major change in content
–
More cautious vs more confident wording
a · Content change of the primary claim
89.9% of primary claims were unchanged or only minorly revised; just 10.2% changed substantially.
b · Hedging shift of the primary claim
When wording shifted, it moved toward caution twice as often as toward confidence (8.4% vs 4.2%).
Caution scales with revision · Figure 1c
The bigger the revision, the more the wording softens
Among unchanged abstracts the certainty of the claim almost never moved. Among majorly revised claims, the wording became more cautious in 38.5% and more confident in only 19.8% of assessable cases.
c · Hedging direction within each content-change stratum
Percent of claims with an assessable hedging label that became more cautious (blue) or more confident (orange).
2:1
Across the whole corpus, weakened claims outnumber strengthened claims roughly two to one — a ratio below parity in every field.
P<0.001
Two-sided sign test on the 9,150 pairs with any hedging shift.
By field · Figure 1d–e
Revision rate varies by field, but the direction of caution does not
Major revision of the primary claim ranged from 7.2% in bioinformatics to 17.5% in microbiology across the 17 fields with at least 1,500 pairs. Yet the shift toward caution held everywhere: the strengthened-to-weakened ratio stayed below one in every field.
d · Content-change composition by field (ordered by major-revision rate)
e · Strengthened-to-weakened ratio of primary claims by field (dashed line = parity)
Claim types · Figure 1f
The kind of claim almost never changes
The primary claim type was preserved in 96.5% of pairs. Among the few that did change type, transitions ran mostly between adjacent categories — for example mechanistic to descriptive or associative — rather than collapsing to a null result.
f · Primary claim-type transitions among pairs whose type changed
Band width is proportional to the number of pairs. Descriptive claims show the largest net gain; mechanistic claims are the most often reclassified.
Drivers of revision · Figure 2
Larger revisions track longer review and higher-impact journals
Revision moves the claims of an abstract together, declines year on year, and rises with both the length of peer review and the impact of the destination journal.
a · Secondary claim follows the primary
When the primary claim was majorly revised, the first secondary claim also changed in 90% of pairs, versus only 34% when the primary was unchanged.
b · Major-revision rate by primary claim type
Method claims are the most stable (5.4% major). Secondary claims (red) are revised more often than primary claims (blue) for every type.
c · Major-revision rate by year of preprint posting
Major revision fell from 17.0% in 2019 to 5.7% in 2024.
d · Content change by review-time tertile
Major revision rose from 7.0% in the fastest tertile (median 110 days) to 14.1% in the slowest (median 416 days).
e · Any revision vs journal impact (2-yr mean citedness, log scale)
Revision rose ~23 percentage points per tenfold increase in journal impact (weighted fit R²=0.77; 59,012 pairs, 908 journals).
f · Retraction rate: preprinted vs never-preprinted
Papers never posted as preprints were retracted about twice as often (18.7 vs 8.1 per 10,000; rate ratio 2.31, 95% CI 1.20–4.45, P=0.003).
Validation · Suppl. Figure 2
The model agrees with experts as well as experts agree with each other
On a stratified subsample of 120 pairs, model–expert agreement (Cohen's κ 0.63–0.66) matched the agreement between the two domain experts (κ 0.60). Three replicate model runs agreed at κ=0.75.