*
الاثنين: 15 ديسمبر 2025
  • 10 نوفمبر 2025
  • 11:25
Study AI Capabilities May Be Overstated Due to Flawed Tests

Khaberni - Researchers from a new study said that methods used to assess the capabilities of artificial intelligence systems often overestimate their performance and lack scientific accuracy.

The study, led by researchers at the Oxford Internet Institute in partnership with over 30 researchers from other institutions, examined 445 prominent AI tests called benchmark performance standards, which are often used to gauge the performance of AI models in various subject areas.

AI developers and researchers use these benchmarks to assess the capabilities of models and promote the technical advancements they are making, relying on them to present claims about topics ranging from software engineering performance to the capability for abstract thought.

However, the study, which was released last week, claims that these fundamental tests may not be reliable and casts doubt on the validity of many results of benchmark performance standards, according to a report by "NBC News" reviewed by "Arabia Business".

According to the study, a significant number of prominent benchmark standards fail specifically to identify what they aim to test, reuse data and testing methods from existing standards in a concerning manner, and rarely use reliable statistical methods to compare results between models.

Adam Mehdi, a senior researcher at the Oxford Internet Institute and one of the key authors of the study, argued that these benchmarks can be misleadingly alarming.

Mehdi told "NBC News": "When we ask AI models to perform certain tasks, we are often actually measuring completely different concepts or structures than what we aim to measure."

Andrew Peen, a researcher at the Oxford Internet Institute and one of the key authors of the study, agreed that even reliable benchmark standards are often blindly trusted and deserve more scrutiny.

Some of the benchmarks examined by the study measure specific skills, such as Russian or Arabic language abilities, while other standards measure more general capabilities, such as spatial thinking and continual learning.

One of the core issues for the researchers was whether a performance standard is actually a good test of the real-life phenomenon it aims to measure. Instead of testing a model on an endless series of questions to assess its ability to speak Russian, for example, one of the benchmarks reviewed in the study gauges the model's performance in nine different tasks, such as answering yes or no using information taken from Russian Wikipedia.

Nevertheless, about half of the performance standards examined by the study fail to clearly define the concepts they claim to measure, raising doubts about the ability of these standards to provide useful information about the AI models being tested.

In the new research paper, the authors recommend eight enhancements, and they present a checklist to regulate comparison standards and improve transparency and trust in them. The proposed improvements include defining the scope of the assessed procedures, constructing sets of tasks that better represent the overall measured capabilities, and comparing model performances through statistical analysis.

This study builds on previous research that indicated flaws in several AI performance benchmarks.

Last year, researchers from the artificial intelligence company "Anthropic" called for increased statistical tests to determine whether a model's performance in a specific benchmark truly demonstrates a difference in capabilities, or if it is merely a fortunate result based on the tasks and questions included in that standard.

مواضيع قد تعجبك