Artificial intelligence tools are now appearing inside top-tier research papers – and in some cases, they are introducing references to studies that do not exist.
The problem has surfaced in accepted conference papers, raising concerns about how easily reference errors can slip into peer-reviewed work.
A recent scan of 4,841 accepted papers identified 100 fabricated citations across 51 submissions. The review came from GPTZero, which examined reference lists after finding that citation mistakes often survive peer review.
The results matter because they involve conferences such as NeurIPS, one of the most selective venues in artificial intelligence research.
While invented citations can trigger rejection or revocation, the findings highlight a broader risk: as AI writing tools spread, even small reference failures can make verification harder and can weaken trust in scientific publishing.
The damage of false AI references
A fake citation is more than a typo, because it breaks the trail that lets readers track evidence.
Some authors used a large language model (LLM) – a text-prediction system trained on vast collections of text – which can invent sources.
NeurIPS pointed out that even if 1.1 percent of papers contain one or more incorrect references due to the use of large language models, the papers’ core findings are not necessarily invalidated.
That stance protects valid results, yet it still leaves readers with extra work when they need to verify claims.
Why fake references appear
Prediction-driven writing rewards plausibility, so an LLM can sound confident while guessing details it never truly checked.
Because the model fills patterns from training text, it may blend real authors with incorrect journals and dates.
Standard citation styles add believable structure, making these mistakes harder to spot during a fast final edit.
Simple database searches can stop this, but only if someone runs them before trusting the generated reference list.
Citations as career currency
In research hiring, citation metrics – measures of how often a paper is cited – often sit alongside letters of recommendation and awards.
Those numbers matter because they signal attention, which can translate into funding, jobs, and invitations to collaborate.
The San Francisco Declaration on Research Assessment (DORA) urges institutions to judge the work itself, not journal-based scores used as shortcuts.
Made-up citations blur those signals, and they can reward sloppy behavior by padding influence that was never earned.
Too many papers to check
Official statistics put the NeurIPS main track at 21,575 submissions and 5,290 acceptances, a 24.52 percent rate.
That volume forced organizers to rely on a huge volunteer network, where reviewers juggle research, teaching, and deadlines.
Program chairs wrote that limited time kept them from manually revisiting every outlier decision flagged by scores.
When attention runs thin, reference lists become easy to skim, so small errors can slide into the final record.
How references get verified
Citation checkers start by splitting each reference into parts, then they standardize spelling and punctuation across entries.
Next, they query bibliographic databases, online indexes that store paper titles and authors, and flag entries with no matches.
Most systems also score near-matches, since small typos can hide in initials, page numbers, or conference names.
A flagged entry still needs judgment, because older books and early drafts posted online sometimes sit outside major databases.
Rethinking peer review incentives
Calls for reform are growing, since conferences depend on goodwill while the number of submissions keeps climbing each year.
A position paper proposed letting authors rate review quality and giving reviewers formal credit for effort.
Those feedback loops could discourage rushed, template-like reviews, because poor work would become visible to the same community that submits papers.
Even with better incentives, automated citation checks may still be needed so reviewers can spend their time on results.
Preventing citation mistakes
Careful authorship treats citations as evidence, so the reference list deserves the same attention as figures and tables.
Reference managers can pull details from databases, which reduces hand typing and keeps titles, years, and author order consistent.
When AI systems help draft text, verifying each referenced title in a search engine can catch fabricated sources before submission.
That habit adds minutes, yet it spares readers from chasing dead ends when they try to follow the supporting literature.
The ripple effect of mistakes
Evidence from earlier tests shows that chatbots can produce polished reference lists even when the sources do not exist.
One peer-reviewed study found that 55 percent of AI-generated references from an earlier ChatGPT model were fabricated, while a newer version reduced that rate to 18 percent.
Many of the remaining errors blended real and fake details, making quick human checks less reliable.
Those small reference failures can easily spread beyond a single paper. Under deadline pressure, conference submissions and other scholarly work may inherit unverified citations pasted into otherwise careful prose. Those errors can then ripple from review panels to everyday readers.
Clearer policies, better tools, and routine citation checks can help protect trust in scholarship as AI writing tools become more common.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–