Limitations of LLMs

Jakub Hyzyk; Melanie Misanchuk

Limitations of LLMs

Inherent Characteristics – Probabilistic, not Deterministic

Many of the limitations of LLMs stem from characteristics inherent in their design. Unlike much of the software we are used to working with, whose deterministic nature offers predictable outcomes given a specific input, LLMs operate on a probabilistic framework. This means that when ChatGPT is formulating an answer to a prompt, it doesn’t really “comprehend” anything it is writing; rather, it is probabilistically assembling the closest thing it can to what it considers a “good answer,” drawing on its training data as source material. What it considers a “good answer” comes largely from the fine-tuning we discussed in the last section, and importantly, how “good” the answer is has much more to do with its resemblance to the form of exemplar answers than to the exact content of the answer.

Accuracy

While probabilistic design enables ChatGPT to excel in generating text that is syntactically correct and contextually plausible, it is also a vulnerability, as it inherently prioritizes textual coherence and fluency over factual accuracy or logical consistency. ChatGPT’s “accuracy” in generating information can vary significantly, as its outputs are based on patterns in the data it was trained on, and the quality of that data varies. So while ChatGPT can produce responses that seem accurate, and can do so with boundless confidence, its reliance on its training data means it may inadvertently propagate inaccuracies present within that data.

But it is not at all clear that the issue is limited to the training data. As Yann LeCunn, Chief AI Scientist at Meta puts it:

“Large language models have no idea of the underlying reality that language describes… Those systems generate text that sounds fine, grammatically, semantically, but they don’t really have some sort of objective other than just satisfying statistical consistency with the prompt.” (Smith, 2023)

Errors of generation in the output of LLMs are often colloquially referred to as hallucinations. We consider this is an overly broad definition, and would refer readers to a more thorough discussion of the types of errors ChatGPT specifically is prone to, in Ali Borji’s “A Categorical Archive of ChatGPT Failures.” Some of these errors have already been corrected and more will be eventually addressed. The models are continually being improved and individual problem cases will be targeted as they are identified. But so long as the inherent architecture remains probabilistic, it is likely that unexpected errors will still arise in the output.

Because of this, output from LLMs must be vetted by someone with sufficient subject matter expertise to spot errors that otherwise look plausible to a non-expert. This is especially important with output that can be “formed correctly” (e.g., an ISBN number, an APA formatted reference, a barcode) while containing incorrect information. Quite often, only an expert in the domain ChatGPT is writing about could spot such errors, which makes them all the more dangerous for students or novices.

Precision

The same probabilistic design allows for flexibility and adaptability in generating responses, enabling ChatGPT to produce varied and contextually appropriate outputs. However, it also introduces a degree of unpredictability in its outputs. Given the same prompt, ChatGPT might generate different responses at different times, reflecting the range of possibilities it has learned during training. While ChatGPT exhibits a kind of conceptual precision in consistently following the patterns it has learned, precision, in the scientific sense, refers to the reproducibility of results under the same conditions. ChatGPT’s outputs are inherently variable due to its probabilistic nature, and it will generate diverse responses even to identical prompts, reflecting a wide range of potential answers rather than a single, repeatable result.

Black-box Problem

Related to the accuracy and precision issues discussed above, it is sometimes said that ChatGPT and other LLMs have a “black-box” problem, referring to the opaqueness of their inner workings. Even skilled developers may struggle to understand or trace how these models arrive at a particular output based on the input provided. This lack of transparency makes it difficult to diagnose errors, understand model biases, and ensure the reliability of the model’s outputs. This presents two major problems; first, it hinders identifying and resolving errors. If a model produces an inappropriate or unsafe response, understanding the internal decision-making process behind it is crucial for correction. Second, the lack of transparency erodes trust, particularly in high-stakes applications (e.g., chatbots that might influence medical, legal, or ethical decisions).

For users of commercial LLM services like ChatGPT, Gemini, and Bing Copilot, the black-box problem becomes a multiplier to any accuracy and precision issues. The models used in these live services are constantly evolving, and at best, users receive broad-strokes announcements of functionality updates when major releases occur. OpenAI, Google, and Microsoft do not publish detailed changelogs, itemizing bugfixes for each iterative “x.1” update, the way they might for other software. On discussion forums and listservs dedicated to ChatGPT, users often report performance on specific types of queries or tasks (e.g., arithmetic) changing over time; generally for the better, but sometimes for the worse.

Thus an end user knows that the models are at times inaccurate, in ways that are often subtle and counterintuitive to human reasoning. They know that chatbots are by design somewhat imprecise, and will not reliably respond to the same input the exact same way. And because of the black-box problem, they know that the degree and type of inaccuracy and imprecision can vary over time as the models are updated. The user has no visibility either into the updating process, nor into any sort of progress or error log that would allow them to troubleshoot the steps by which the LLM got from their input to its output. This is not meant as indictment of ChatGPT and similar tools; rather we wish to point out that by measures commonly used in STEM, accuracy and precision, LLM chatbots do not fare particularly well. This is a useful mental shorthand to keep in mind when evaluating their suitability for a given task or use case. Many tools that we use everyday to great effect are neither precise nor accurate; snow shovels, blenders, funnels, and many others tools serve their purpose perfectly without a high degree of accuracy or precision. But we should not reach for them when accuracy or precision are important.

Future Improvements

As LLMs are rapidly improved, some of these technical limitations will likely be addressed. We briefly discuss some of the promising lines of research and development being pursued.

Brute Force

One of the advantages of the transformer architecture is that performance scales predictably with applied computing power, size of training dataset, and number of parameters in the model (Kaplan et al., 2020). Thus the low-hanging (though expensive) fruit of applying “more of everything” to the problem will always be the first choice of most LLM operators, until resource constraints make it too costly or impractical.

Expanding Extensibility

Because current LLMs are well-suited to language tasks and ill-suited to other tasks, a logical approach is to use them where they are strong and provide them access to other tools where they are not. Indeed, this is already possible through ChatGPT’s plugin architecture and powerful tools like Wolfram Alpha. It should be noted however, that as the reliance on external tools grows, the ability of the LLM to fully understand what is being asked of it in the prompt, the capabilities of its “tools,” and how to correctly format inputs for them, becomes increasingly important.

Appealing to Authority

One method developers use to reduce the tendency of LLMs to hallucinate is incorporating Retrieval Augmented Generation (RAG):

RAG involves an initial retrieval step where the LLMs query an external data source to obtain relevant information before proceeding to answer questions or generate text. This process not only informs the subsequent generation phase but also ensures that the responses are grounded in retrieved evidence, thereby significantly enhancing the accuracy and relevance of the output. (p. 1, Gao et al., 2024)

The quality of the authoritative data source queried obviously has a huge impact on how well RAG works. Even the ability of some LLMs like Bing Chat/Copilot to incorporate web search results can be seen as an imprecise form of RAG, and early user feedback indicated that the search results incorporated into outputs (helpfully, Bing Chat/Copilot cites the results) was of mixed quality. Nevertheless, where such data exists, the technique can work very well, and it remains an active area of development.

Improving Interpretability

Because of the aforementioned “black-box” nature of LLMs, when the output is deficient in some way, it can be difficult to know why, even when we can see how. Interpretability research is broadly focused on methods that will provide insight into the processes of LLMs use to get to their conclusions.

“Technical” Limitations Related to Business Models

Privacy and Security Issues

The privacy and security concerns related to the use of LLMs in academic settings fall into three broad categories:

Data Storage and Retention: There are concerns about how student and faculty data input into LLMs are stored, for how long, and under what conditions. The lack of clarity about data retention policies can raise questions about the potential for misuse of sensitive information.
Security of User Data: The risk of data breaches is significant, as such incidents could expose confidential academic work, personal information of students and faculty, and proprietary research data.
Vendors’ Business Models: The business models of LLM providers might not always align with the best interests of educational institutions regarding data privacy and security. There is concern that student data could be used for purposes beyond the educational scope, such as training the models without explicit consent or for commercial gains.

Addressing these concerns requires transparent policies from LLM providers on data handling, robust security measures to protect user data, and clear contractual agreements between vendors and institutions that prioritize the educational institution’s privacy and security requirements. Institutions should treat LLMs no differently than other software that they license for student use, and should demand and expect the same contractual guarantees regarding data security and privacy they do for other enterprise software.

Media Attributions

This image was created using DALL·E
This image was created using DALL·E

License

Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

ChatGPT in STEM Teaching: An introduction to using LLM-based tools in Higher Ed Copyright © by Jakub Hyzyk and Melanie Misanchuk is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, except where otherwise noted.