As modern astronomy matures, individual researchers are becoming increasingly specialized, often at the expense of detailed knowledge in other fields; meanwhile, the dissemination of the vast datasets being collected by today’s telescopes is limited by the work hours available in the research community. The recent and rapid development of large language models (LLMs) may present a solution to both these problems, but only if their grasp and manipulation of the astronomy literature can be trusted. Yuan-Sen Ting and colleagues investigate this question through a comprehensive comparison of existing proprietary and open-weights (that is, modifiable and free to access) LLMs, and touch on the bigger issue of how artificial intelligence has the potential to transform the practice of science.
The authors compile a benchmarking set of ~4,500 multiple-choice questions, covering a range of astronomy topics from over 85 published review articles, and feed these into a selection of ~20 LLMs to evaluate the accuracy, cost-efficiency, rate of improvement and confidence level of each. They find that though proprietary models such as Anthropic’s Claude series generally outperform open-weights — with a maximum accuracy of 85% but significant scatter — their current costs may be prohibitive, whilst open-weights tend to be less accurate but more affordable to run, and are undergoing more rapid improvement. Interestingly, they also identify human-language-based and subject-specific performance differences between models, which they attribute to the varying amounts of training data available in each case.
This is a preview of subscription content, access via your institution