Science journalists find ChatGPT is bad at summarizing scientific papers (but are they, really?)

September 24, 2025Written\ Created \ Syndicated by Taofeek Folami

As reported by Ars Technica, with many more details in the White Paper (PDF) written by the Science Press Package team, SciPak.

I have no reason to doubt the findings, but do note the caveats that appear in the paper itself, that,

This does not mean that the LLM has no potential value as a tool for other science communication
outlets. The findings of this project are specific to ChatGPT Plus’ adherence to SciPak style and
standards. Moreover, this assessment could not account for human biases…

Regarding that last point, Ars Technica points out,

…which we’d argue might be significant among journalists evaluating a tool that was threatening to take over one of their core job functions.

The actual prompts used by the evaluators are listed in the appendix of the paper (pg. 9), and are a nice illustration of how one should write a prompt if one is looking for a specific type of response. Sadly, the paper doesn’t indicate whether the results of that most-specific prompt were generally better than the less-specific ones:

In early April 2024, the team revised the writer survey to include more specificity. Before then, each
writer who nominated a paper reviewed the overall ability of ChatGPT Plus, assessing its collective
performance across the three generated summaries. After the revision, writers evaluated the LLM’s
performance for each individual summary instead. This led to a more detailed interpretation of the
LLM’s skills. Because this data is qualitative and anecdotal, it does not lend itself to graphs.

It’s important to do your own testing, I think, because one of the ways we’re seeing students, especially, use LLMs is for exactly this purpose – summarizing longer and more difficult papers. If the summarizations are wrong, that’s obviously concerning, but if the summarizations are right, but don’t conform to a particular style, that’s much less concerning, IMHO, and could possibly be corrected through better prompting.

Source of Article

TemiLib

Science journalists find ChatGPT is bad at summarizing scientific papers (but are they, really?)

Similar posts

Artificial Intelligence Safety: An Interview with Stanford Research Fellow Duncan Eddy

Meet Dan Montgomery

Bound to Browsable: Unlocking the Historical Media Publications Collection

Archives

Help & Support

Subscribe