ChatGPT is NOT an oracle and should NOT be treated as one

5 minute read

Published:

Published at the special issue of Bocconi’s Magazine.

The other day, I hear someone who is highly ranked in their professional career saying: “I didn’t know how to answer that question because I didn’t know what they meant by that term in that context, so I asked ChatGPT to solve my doubt.” A couple of weeks ago, a friend of mine goes to the doctor. At some point, the doctor pulls out her cell phone in the middle of the discussion, googles, shows my friend Google’s generated AI Overview and tells him she is right because AI Overview confirms her belief. On a beautiful Sunday, my friend tells me he wants to change his dog’s diet because he is getting old. He is planning to ask ChatGPT to calculate the proportions for the diet. He asks me what I think about it. I don’t even have time to reply as he looks at my worried expression and continues saying: “yes, then I can ask Claude or Deepseek to compare answers and estimate the accuracy.” At that point, my worried face grows even more expressive, turning almost into despair.

These are only a few examples of what I’ve heard or experienced in the last weeks: two expert people relying on generated answers as the source of truth and a friend comparing different AI systems as an estimate for the accuracy of the generated information. Instinctively, my reaction is “absolutely not—please stop doing that now.” Then, my researcher reaction takes over and asks: where are the papers that prove this is very problematic? And indeed, there’s plenty of them. There’s extensive research showing the problematic behavior of large language models (LLMs), the same language models that power chatbot interfaces such as ChatGPT, Claude and Deepseek. For example, article citations generated by LLMs are often hallucinated or misplaced; summaries produced through retrieval-augmented generation (RAG) web searches may not accurately reflect the content of the original webpages; factual information can be fabricated; low-frequency knowledge in the training data is difficult for LLMs to learn; and LLMs exhibit sycophancy by agreeing with users even when the response is incorrect. This list of limitations is not exhaustive and continues to grow.

The so-called LLMs have mostly been trained with documents (webpages) crawled from the internet. The information that they learn is the information that is present in these webpages. The great majority of them actually come from blog posts and forums which have never been checked for accuracy or truthfulness before. Only a VERY small portion of the training data is considered “high-quality” as they come, for example, from accredited news outlets, wikipedia pages, books, and peer-reviewed articles. Moreover, even if LLMs are trained by different organizations, because these models are so greedy for data, their creators all use as much data available from the internet as possible (filtered for harmful content). Consequently, LLMs created by different organizations end up having similar biases and respond to user queries in similar ways, as shown by previous studies. Going back to my friend with the dog’s story, this is enough evidence that if one model is wrong, the other one will most likely be wrong as well – unless it has not been post-aligned with the correct information, which we have no way of knowing since most LLMs’s chat-interfaces available online such as ChatGPT and Claude are proprietary, and therefore, their training is unknown to us.

My point is not to stop using ChatGPT and other chat interfaces because they are language technologies that are here to stay, to help us in being more creative, productive, and efficient. Chat-interfaces powered by LLMs serve many great functions and one of them is to provide us information, similar to books, wikipedia pages, and news articles. However, there are two main differences between the more traditional sources and LLM-powered chatbots. First, these tools digest and summarize information for us at a much higher degree than books, increasing their appeal because of their convenience and efficiency. After all, we humans like to make shortcuts to lower our cognitive load. Secondly, every interaction with LLMs is new and the same question can be answered in different ways if asked even with the same prompt (do yourself the test). This means that the answers provided by these technologies have not been peer-reviewed, checked, nor have they undergone any editorial process or decision.

That said, my main point is that we must be even more critical with these tools than when consuming information from traditional media. In practice, this means verifying generated content against authoritative sources rather than taking it at face value. This could be done depending on the task. Tasks such as paraphrasing or translating text are generally low-risk because the tool just transforms information we’ve already written, though the output should still be reviewed for potential misunderstandings. On the other hand, asking LLMs to brainstorm ideas, explain concepts or retrieve facts are considered more high-risk tasks because the resulting information may contain errors and misinformation because of hallucinations or simply because the LLM has learned incorrect information. The truthfulness of the generated answers in high-risk tasks should be evaluated carefully and confirmed through other authoritative sources. I am aware that the process takes longer… but is reaching a post-truth reality a good trade-off for being more “efficient”? I personally believe it is not and I believe that we should care about the information we transmit, and the information we rely on to make personal and professional decisions.