1.2 What do we know about how large language models work?

Dani Dilkes

1.2 What do we know about how large language models work?

What do we know about how large language models work?

Training

The AI model is pre-trained on a large dataset, typically of general texts or images. For specialized AI models, they may be trained on a specific dataset of subject or domain specific data. The AI analyses the data, looking for patterns, themes, relationships and other characteristics that can be used to generate new content.

For example, early models of GPT (the LLM used by both ChatGPT and Copilot) were trained on hundreds of gigabytes of text data, including books, articles, websites, publicly available texts, licensed data, and human-generated data.

Human intervention can occur at all stages of the training.

Humans may:

Create and modify the initial dataset, removing messy or problematic data
Assess the quality of output from the AI model

The model is then released for use and can be accessed by users

Generation

LLMs generate text in response to a user-provided prompt.

A cycle showing 4 steps: User inputs prompt; LLM tokenises prompt; LLM predicts response; LLM shares output. — A diagram showing the cycle of interacting with a Large Language Model.

For more a more detailed introduction to generative AI, see this video from Google:

Introduction to Generative AI

For a more in-depth look at how LLMs function, see this article from the Financial Times:

Generative AI exists because of the transformer

What are the Limitations of Large Language Models?

Generative AI is evolving quickly but still has certain limitations. Large Language Models (LLMs) are constrained by the data upon which they are trained and the methods through which they are trained. It’s important to be aware of the limitations of the tools that you’re using, especially if currency or accuracy is important for the tasks that you’re using generative AI to complete.

We do not fully understand how LLMs work, which presents issues for safety, reliability and accuracy.
LLMs are susceptible to hallucinations or the creation of nonsensical words, phrases, or ideas. This can also result in the generation of non-existent references .
Many LLMs are pre-trained and have knowledge cut-off dates, meaning that data may be out of date or inaccurate. However, increasingly generative AI tools are able to access and process information in real time. This is called Retrieval Augmented Generation (RAG).
There is a trade-off between processing speed and accuracy with LLMs. Many basic models do not fact-check, meaning that the information that they share is not guaranteed to be accurate or logical. These models produce much faster results at the risk of lower accuracy. Reasoning models have increased accuracy because they break tasks down into micro-steps, apply logic, and evaluate possible results. However, they have longer processing times and require significantly more resources. They are also not immune to making mistakes.
Standard LLMs produce output based on averages or probabilities of patterns, so they are susceptible to reproducing biases found in their data sets, including but not limited to biases based on human biases that may be embedded in historical records, cultures, patterns of research, societal norms, and any other elements reflected in the text data used for their training. This will be discussed more in the Ethics section.

← Previous
Next →

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Share This Book

Feedback/Errata

Comments are closed.