From classrooms to workplaces and recreational spaces, ChatGPT has evolved beyond being merely a conversation topic. For educational and professional purposes alike, the AI chatbot has become a common tool for many, and its rapid adoption raises several concerns. For one, there is a notion that “we are making ChatGPT dumb,” after users seemingly noticed a shift in the bot’s behavior. For many reasons, though, it can be difficult to tell whether this collective criticism holds any water, or if it’s just a reflection of higher levels of engagement. Is there a way to get a definitive answer? Let’s take a look.
What Does “Dumb” Mean?
First thing’s first, what kind of answer do you expect to a question like “are we making ChatGPT dumb?” What does “dumb” mean, and how do you measure it? Colloquially, we throw qualifiers like smart, dumb, and intelligent around pretty freely and, depending on the context, we do so with varying levels of precision. However, when it comes to evaluating the performance of a Large Language Model (LLM), it’s difficult to put a number to what might be considered dumb.
Generally, the performance of bots like ChatGPT are evaluated over several characteristics that aren’t really mutually comparable. For example, a big factor in qualifying a “good” response is whether ChatGPT responds accurately. Another factor is whether its response is using correct and/or appropriate grammar and vernacular. Qualification of responses often also considers response length (verbosity) and safety. Importantly, these factors might clash when trying to evaluate a response. For example, if a user prompts ChatGPT to describe how to steal a car, how would accurate and clear instructions compare to entirely inaccurate instructions? On one hand, a response is graded for accuracy, but on the other hand, it is also graded for safety. Is an inaccurate answer to an illegal prompt safer? Is providing inaccurate responses to illegal prompts considered “smart” or “dumb”?
Unless an LLM starts performing significantly worse across the board, across all of these categories unambiguously, it’d be tough to seriously claim that the bot is getting dumb, or that we are making it dumb. The truth is likely more nuanced, and research appears to support this perspective.
Changes in Performance
What does appear to be true is that ChatGPT’s performance has changed in several ways. Nuance is key, however. Researchers from Stanford and Berkely have released this document, which has yet to be peer-reviewed. The study compares ChatGPT’s performance from March 2023 and June 2023, checking both GPT-3.5 and GPT-4.
Like we mentioned, there are multiple factors to consider, and the paper goes through several of these. Importantly, there aren’t clear conclusions beyond the fact that behavior has shifted in some senses. Some online articles claim that the conclusions point towards a clear drop in performance, but a sincere read of the article certainly does not suggest the same. Between GPT-3.5 and GPT-4, for example, one model seems to improve in some factors while the other model’s performance drops, and vice-versa. For example, GPT-3.5 proved slightly more likely to produce unsafe responses to some prompts, while GPT-4 proved significantly more likely to refuse to produce unsafe responses.
Beyond the ambiguity of ranking responses “smarter” or “dumber” is also the ambiguity of causality. LLM are incredibly complex models as is, and it can be difficult to pinpoint exactly why certain responses are formed at all, and much more when several factors are being altered. For one, from March to June, it’s not like the models have just been sitting in the cloud. OpenAI has rolled out several updates to improve the models in several ways. The models also do learn from interactions from users, but in limited ways. OpenAI has also allowed users to opt out of this interaction data collection.
So, are we making ChatGPT dumb? It’s just hard to say. If you trust the research paper, signs don’t exactly point towards a unilateral drop in performance. And, if you don’t trust the paper, it also not accurate to claim that user input is the sole or even necessarily the most significant evolving factor. Changes in behavior can happen for a variety of reasons and, likely, are happening for a combination of all of the above.
Caution with ChatGPT
Perhaps the most important lesson to learn from this deep-dive is that, for most people, AI is a bit of a black box. As per the research paper, AI professionals are perfectly aware that bots like ChatGPT aren’t guaranteed to produce accurate or even “good” responses, something that OpenAI has been very open about, as well. As these tools continue to be incorporated in professional, academic, and personal spaces, this same discretion needs to be exercised by all individuals.
At the same time, these AI tools are useful tools to use cautiously. And they are learning from interactions with users, even if it’s only in specific and limited ways. This puts a responsibility on users to be responsible with their interactions as well. On a pragmatic level, and specially on an ethical level, we should avoid abusing these tools for amusement one day and then expecting consequence-free responses for critical applications on another day. When it comes to ChatGPT, it’s a good rule of thumb to interact responsibly and mindfully with the bot, and to take its responses with a grain of salt no matter what.
Living Pono is dedicated to communicating business management concepts with Hawaiian values. Founded by Kevin May, an established and successful leader and mentor, Living Pono is your destination to learn about how to live your life righteously and how that can have positive effects in your career. If you have any questions, please leave a comment below or contact us here. Also, join our mailing list below, so you can be alerted when a new article is released.
Finally, consider following the Living Pono Podcast to listen to episodes about living righteously, business management concepts, and interviews with business leaders.