Blog

We Benchmarked 80,000+ Human and AI Review Comments. Here's What We Found.

We Benchmarked 80,000+ Human and AI Review Comments. Here's What We Found.

AI review tools are emerging as a potential solution to our increasingly strained peer review system, but to date, there's no standardized way to evaluate them.

We built ReviewBench, a venue-agnostic, extensible benchmark framework to evaluate human and AI peer reviews. We use it to compare human reviews with Reviewer3 (R3), a multi-agent system, and two leading frontier reasoning models, GPT-5.2 and Gemini 3 Pro, across a dataset of 1,000 ICLR 2025 papers.

The dataset consists of over 80,000 review comments: 35,149 human, 18,708 R3, 21,709 GPT-5.2, and 5,899 Gemini 3 Pro. GPT-5.2 and Gemini 3 Pro were provided the same minimal, venue-agnostic prompts yet output vastly different comment volumes.

Total CommentsComments Per Paper

Figure 1. Total number of review comments and comments per paper by source on a dataset of 1,000 ICLR 2025 papers.

AI Reviews Are More Structured

A peer review comment can include several structural elements: a reference to a specific claim in the manuscript, an issue with that claim, a rationale for why it is an issue, a suggestion for how to address it, and a location anchor. We extracted these structural dimensions from every human and AI review comment. Perhaps unsurprisingly, AI reviews are more structured than human reviews, and R3 leads each dimension: 99.8% of its comments for a given paper cite specific issues with a claim, 96.0% provide rationale for why it is an issue, 93.9% are actionable, and 89.9% are anchored. Human reviews, by comparison, are specific 71.5% of the time and actionable just 24.0%.

Rationale RateAnchored Rate

Figure 2. Rationale and anchored rate per paper, by source.

Engagement with Major Results

We adapted prior work to extract 3-7 major results from a paper, map each comment to this predefined set of results, and label them with a stance (supportive, critical, or neutral). R3 maps 87.9% of its comments to a paper's major results, compared to 60.3% for GPT-5.2, 67.8% for Gemini 3 Pro, and just 48.9% for human reviewers, suggesting greater engagement with a paper's main conclusions. We also find that R3 is the most critical reviewer— 96.7% of its comments are critical, compared to 80.8% for GPT-5.2, 77.2% for Gemini 3 Pro, and 64.4% for humans. Humans are the most balanced across supportive (23.6%), critical, and neutral (12.0%) stances, reflecting typical ICLR review structure with sections designated for Strengths, Weaknesses, and Questions.

Mapped Rate

Figure 3. Percent of comments that map to one of the 3-7 major results within a paper, by source.

Comments that Undermine Result Validity

Despite mapping to the same result and having the same stance, comments can vary in impact. Thus, we defined a consequential boolean, which labels comments that, if true, would undermine the validity of a major result. R3 achieves a 92.5% consequential rate, meaning the vast majority of its comments could undermine a major result. Humans sit at 61.8%.

When we restrict to critical comments, the consequential gap narrows considerably: R3 achieves 95.3%, Gemini 3 Pro 91.9%, GPT-5.2 90.6%, and humans 89.3%. However, when we rank sources per paper on consequential-critical rate, human reviewers rank 1st on 501 of 1,000 papers — more than any other source — while R3 drops to 301. Although R3 has a higher average, human critical comments are more often the most consequential on a per-paper basis, reflecting high variance but sharper criticisms at the top end.

Consequential RatesPer Paper Ranking on Consequential-Critical Rate

Figure 4. Percent of mapped comments that would undermine one of the 3-7 major results within a paper (left), and per-paper ranking on consequential-critical rate (right).

Agreement with Humans

We define two new metrics that can measure how well each AI system identifies the same critical results flagged by human reviewers. Recall is the percentage of human criticals each AI source also catches. Precision is the percentage of each AI's criticals that humans also flag.

R3 and GPT-5.2 achieve 87.5% and 86.0% recall respectively, critiquing the vast majority of results that human reviewers critique.

We introduce a third metric, novel corroborated critiques, which measures the extent to which each AI system also identifies issues not flagged by human reviewers but corroborated by at least one other AI system. R3 and GPT lead, with R3 finding 1.12 and GPT-5.2 finding 1.10 novel corroborated critiques per paper, compared to 0.57 for Gemini 3 Pro. 92.5% of novel corroborated critiques are shared between R3 and GPT-5.2, suggesting that AI sources converge on the same issues when identifying problems that human reviewers missed.

Novel Critiques

Figure 5. Total critical results, critical results that agree with human, and critical results that are novel and corroborated by at least one other AI source, per paper.

Future Directions

ReviewBench evaluates review quality by parsing each comment into its structural components — issue, rationale, actionability, and anchoring — and mapping comments to a predefined set of major results per paper for direct comparison. Notably, while AI reviews are more consistent on average, led by R3, human reviewers rank 1st in consequential-critical rate on more individual papers than any AI system, underscoring that both have distinct strengths. The framework is venue-agnostic and extensible to any AI review system, allowing new tools to be evaluated repeatedly and at scale. We look forward to sharing the preprint detailing the full methodology and results.

Ready to see what AI peer review looks like? Try R3 on your own paper.

See the Evidence in Your Own Work

Upload a paper or grant and receive one free review—no credit card required.

Get a Free Review