• mabeledo@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    16 hours ago

    Even the number is a bit misleading. First of all, anyone who has ever done LLM benchmarking knows that this isn’t an exact science, at all. You can totally get a 99% on a benchmark and fail every single task on another.

    But even this particular claim is nuanced. From the original article:

    But with Gemini 3, Google’s A.I.-generated answers were more likely to be ungrounded than when the system was based on Gemini 2, meaning the websites they linked to did not completely support the information they provided. In October, correct answers were ungrounded 37 percent of the time. In February, with Gemini 3, that figure rose to 56 percent.

    See https://www.nytimes.com/2026/04/07/technology/google-ai-overviews-accuracy.html

    Meaning that 56% of the time, users cannot even verify the information given by the LLM with the sources the LLM claims it’s using.