AI research papers are improving, posing challenges for scientists

Journal editors and peer reviewers are facing an overwhelming influx of AI-generated papers, making detection nearly impossible.

Last summer, Peter Degen’s postdoctoral supervisor raised an unusual concern: one of his papers was getting an excessive number of citations. Initially published in 2017, the paper evaluated a statistical analysis technique on epidemiological data. It had garnered a few dozen citations over the years, but recently, it was cited hundreds of times, placing it among his most cited work. Intrigued, Degen’s adviser asked him to investigate.

Degen, a postdoctoral researcher at the University of Zurich Center for Reproducible Science and Research Synthesis, discovered that the citing papers shared a pattern. They analyzed the Global Burden of Disease study, a public dataset from the Institute for Health Metrics and Evaluation at the University of Washington. However, these papers used the dataset to churn out endless predictions: future stroke likelihood in adults over 20, testicular cancer in young adults, falls among the elderly in China, colorectal cancer in those eating minimal whole grains, and so on.

While searching on GitHub for relevant code, Degen was led to the Chinese social media site Bilibili, where he found a Guangzhou-based company offering tutorials on producing publishable research in under two hours using software tools and AI writing assistance. Despite their errors and misrepresentations, these studies weren’t as blatantly incorrect as past AI-generated papers, making them harder to filter out.

“It’s a huge burden on the peer-review system, already at its limit,” Degen said. With an increasing number of papers and insufficient peer reviewers, the advent of technology that facilitates mass production of papers could lead to a breaking point.

Proponents of generative AI are optimistic about its potential to lead to scientific breakthroughs—accelerating discoveries and potentially eliminating types of cancer. However, currently, it undermines a fundamental pillar of scientific research, inundating editors and reviewers with an endless stream of submissions. Paradoxically, as the technology improves in producing competent papers, the crisis worsens.

For years, academic publishing has battled “paper mills,” black-market companies that mass-produce papers, selling authorship slots to academics seeking an edge through published research. It’s been a game of cat and mouse as publishers, often pressured by science sleuths—researchers who specialize in uncovering fraudulent research—struggle to close one vulnerability only for mills to exploit another. Generative AI has become a boon for these mills, helping them circumvent plagiarism detectors by creating new images and texts. Despite AI’s telltale hallucinations allowing theoretical screening, papers still manage to slip through, requiring retractions upon discovery of hilarious blunders like a rat diagram with oversized genitals labeled “testtomcels” or scattered “as an AI assistant” notes left unedited.

Now, AI can produce convincing papers almost wholesale, allowing desperate academics in need of publications to churn out papers independently. This leads to a deluge of scientific sloppiness threatening to overwhelm publishing, peer review processes, grant-making, and the broader research ecosystem as it stands today.

Matt Spick, lecturer in health and biomedical data analytics at the University of Surrey and associate editor at Scientific Reports, first noticed this trend when he received three strikingly similar papers analyzing the US National Health and Nutrition Examination Survey (NHANES), another public dataset. Upon checking Google Scholar, he realized it wasn’t a coincidence. There was a sudden explosion in papers citing NHANES that followed a similar formula, each claiming to find an association between, for example, eating walnuts and cognitive function or drinking skim milk and depression.

“If you’ve got enough computing power, you go through, measure every single pairwise association, and eventually find some that haven’t been written on before and publish: there’s a correlation between this and that,” Spick said. These correlations often oversimplify complex phenomena or represent random statistical flukes. “One was about how many years you spend in education supposedly causing postoperative hernia complications. That’s just a random correlation. What am I supposed to do with that? Leave school early to avoid future hernia complications?”

Over the years, sleuths have developed ways to detect inauthentic papers. Some search for “tortured phrases,” instances where plagiarism detectors were circumvented by running existing papers through a synonym generator, often resulting in nonsense substitutions like “reinforcement learning” turned into “reinforcement getting to know.” Others track duplicated images, analyze author networks, or check citations for hallucinated publications—a classic sign of LLM use. Spick looks for masses of papers following the same template while analyzing public datasets.

These papers might not be wrong, although they often mislead. Nor are they strictly fraudulent; they are simply useless and suddenly easy to produce. Last year, several journals began restricting submissions of papers analyzing public datasets, citing a flood of redundant research.

Spick fears these measures might be addressing an outdated battle