Google confirms: AI chatbot accuracy does not exceed 70%

20 December 2025
17:18

Khaberni - Google found that the accuracy of AI-based language model chatbots does not exceed 70% in many cases, following an intensive study it conducted about the precision of these models, according to a report by "Digital Trends".

Google published the results of this study in an 18-page report, discussing the testing mechanisms and reasons for the low evaluations.

The report indicates that AI language models err in one out of every three questions posed to them, even if their answers appear logical.

The model "GeminAI 3 Pro," recently introduced by Google, achieved the highest score in this test, reaching just over 69%, followed by "GeminAI 2.5 Pro" with 62% and "GPT 5" with 61.8%. The "Cloud Ops 4.5" scored 51% and "Grok" 53%.

Intensive Testing Mechanism

The company's "DeepMind" laboratories, responsible for this study, relied on four different evaluation criteria, which are:

• A parametric criterion: measuring the AI model's ability to accurately reach its internal knowledge base when using real questions.

• A research criterion: this criterion tests the model's ability to search the internet and use general search tools to retrieve and accurately assemble information.

• A multimedia criterion: this test depends on measuring the model's ability to answer queries related to the entered images correctly and accurately.

• Baseline 2 criterion: a broader criterion testing the model's ability to provide answers based on a specific contextual orientation and alignment with this context.

The study comes together with the "Kaggle" (Kaggle) scientific community, one of the largest scientific communities interested in data science, which provides leading resources and tools for studying data and analyzing it appropriately.

Each criterion generates more than 3500 results that have been shared openly with scientific communities, while the company kept a group of tests private, and the result of each criterion is calculated based on the average of general and private tests.

The study also touched on the performance of AI models in a range of designated sectors, such as music, technology, history, sciences, sports, and even politics and entertainment television programs.

Wide Variability in Results

The results achieved by each model varied based on the type of questions and the criterion directed to it. While "GeminAI 3 Pro" was generally the leader, individual criteria differ significantly.

The "Digital Trends" report points to the superiority of "Chat GPT 5" in the baseline and research criteria, with the multimedia criterion being the weakest point in all models.

It's noted that the "Grok 4 Fast" model is the weakest AI model in all tests with an average result of 36%, which dropped to 17% in the multimedia criterion and 15% in the parametric criterion.

This study confirms the shortcomings of AI tools in specialized and detailed tests according to the report, adding that even a small percentage of incorrect answers can cause significant damage in sectors such as health or finance.

Google confirms: AI chatbot accuracy does not exceed 70%

Intensive Testing Mechanism

Wide Variability in Results

Topics you may like

Related News