Benchmarks

Evaluating the Reference Checker

We tested our reference checker on a held-out set of 476 citations containing both real and AI-fabricated references. The test set is drawn from research published in Nature Scientific Reports on LLM hallucinations.

Accuracy

97.5%

464/476 correct

Precision

98.6%

Low false positives

Recall

97.3%

Real refs found

F1 Score

98.0%

Balanced measure

Dataset Composition

The benchmark includes both legitimate citations from published research and AI-fabricated references generated by GPT-3.5 and GPT-4.

Real citations299 (62.8%)
Fabricated citations177 (37.2%)
Total476

Fabricated citations include fully invented papers, fake authors, and non-existent journals, typical LLM hallucination patterns.

Methodology

Our reference checker combines multiple verification approaches:

  • Academic databases: OpenAlex, CrossRef, and other scholarly sources
  • AI-powered verification: Gemini for intelligent matching and validation
  • Multi-source cross-referencing: Citations verified against multiple independent databases

A citation is marked as "not found" only after exhausting all available sources.

Confusion Matrix
Classification performance across all 476 benchmark citations
Predicted Real
Predicted Fake
Actually Real

291

True Positive

8

False Negative

Actually Fake

4

False Positive

173

True Negative

Key Insights

98.6% precision, minimal false alarms

Only 4 of 177 fabricated citations evaded detection. Low false positives mean researchers can trust flagged references warrant investigation.

97.7% of fabricated citations caught

173 of 177 LLM-hallucinated references, including those with fake DOIs and plausible metadata, were correctly identified.

97.3% of real citations verified

291 of 299 legitimate references were successfully matched and linked to their original sources, even with formatting variations.

Example Citations
See how R3 classifies real papers vs AI-fabricated citations.
Real Paper
R3: Found
True Positive
I. Wilmut, ..., K. H. Campbell. Viable offspring derived from fetal and adult mammalian cells. Nature, 385(6619), 810-813.

The famous Dolly the sheep cloning paper, correctly verified.

AI Fabricated
R3: Not Found
True Negative
J. Smith. The impact of migration on the health of older adults. Journal of Gerontology: Social Sciences, 2015, 70(4), 497-505.

Generic author name and plausible-sounding title, but completely fabricated by GPT.

AI Fabricated
R3: Not Found
True Negative
Y. Kang, J. Kim. A Comparison of the Environmental Impact of Molten Salt Reactors and Conventional Nuclear Fission Reactors. Journal of Cleaner Production, 2021, 288, 124959.

Includes a fake DOI that looks legitimate, hallucinated by the LLM.

Real Paper
R3: Found
True Positive
S. E. Carrell, J. E. West. Does professor quality matter? Evidence from random assignment of students to professors. Journal of Political Economy, 2010, 118(3), 409-432.

Influential education economics paper, verified through academic databases.

Real Paper
R3: Not Found
False Negative
World Health Organization (WHO). Global oral health data bank. Geneva: World Health Organization, 2013.

Institutional reports often lack DOIs and standard metadata, making them harder to verify.

AI Fabricated
R3: Found
False Positive
Y. J. Lee. Enhancing students' communication skills in the science classroom through socio-scientific issues-based instruction. International Journal of Science Education, 2017, 39(4), 414-434.

A paper with the exact title and journal exists but with different authors.

See the Evidence in Your Own Work

Upload a paper or grant and receive one free review—no credit card required.

Get a Free Review