Hmm… even just framing this properly already looks pretty hard.
My short answer: yes, hallucinations, user ratings, and “AI surfing” are interesting, but the deeper issue is that consumer AI ratings are not simply model ratings.
A normal user is usually not evaluating “GPT,” “Claude,” or “Gemini” as raw models. They are evaluating ChatGPT , Claude , Gemini , or another AI…