"

1.2 How do large language models work?

Training

The AI model is pre-trained on a large dataset, typically of general texts or images. For specialized AI models, they may be trained on a specific dataset of subject or domain specific data. The AI analyses the data, looking for patterns, themes, relationships and other characteristics that can be used to generate new content.

For example, early models of GPT (the LLM used by both ChatGPT and Copilot) were trained on hundreds of gigabytes of text data, including books, articles, websites, publicly available texts, licensed data, and human-generated data.

Human intervention can occur at all stages of the training.

Humans may:

  • Create and modify the initial dataset, removing messy or problematic data
  • Assess the quality of output from the AI model

The model is then released for use and can be accessed by users.

Generation

LLMs generate text in response to a user-provided prompt.

A cycle showing 4 steps: User inputs prompt; LLM tokenises prompt; LLM predicts response; LLM shares output.
A diagram showing the cycle of interacting with a Large Language Model.
Prompting 

A user provides a prompt, asking the AI model to perform a specific task, generate text, produce an image, or create other types of content.

Prompt: Tell me a joke about higher education.
Tokenization 

The AI breaks the prompt into tokens (words or parts of words or other meaningful chunks) and analyses these tokens in order to understand the meaning and context of what is being asked.

Token Breakdown: ["Tell", "me", "a", "joke", "about", "higher", 
"education", "."]

NOTE: words might also be broken into subword tokens like “high”, “er”, “edu”, “cation”]

These tokens are then converted into vectors (numerical values) that represented the position of the token in relationship to other tokens, representing how likely they are to occur in sequence.

Prediction 

The LLM analyses the vectors and, based on the patterns and other information learned in training, the AI model will start to predict a response to the prompt based on the probability of response tokens appearing in sequence.

For example, in responding to our request to generate a joke, the most likely starting token may be “Why”, “What”, “Here’s” etc.

 

A bar chart titles “Probability Distribution for Next Token”. The Next Token Candidates, in order of probability, are “Why”, “What”, “Here’s”, “To”, and “I”.
Probability chart Generated by DALL-E via ChatGPT 4o in March 2025 for demonstrative purposes; not necessarily an accurate representation of probabilities.
Output 

After predicting a sequence of tokens, the LLM decodes the tokens back into natural language (words and sentences readable by a human). The complete response is shared with the user.

✅ “Why did the student bring a ladder to class? To reach higher education!” 

The user can then submit a follow-up request referencing the original request or output. This is called iteration.

For more a more detailed introduction to Generative AI, see this video from Google:

Introduction to Generative AI

For a more in-depth look at how LLMs function, see this article from the Financial Times:

Generative AI exists because of the transformer

What are the Limitations of Large Language Models?

Generative AI is evolving quickly but still has certain limitations. Large Language Models (LLMs) are constrained by the data upon which they are trained and the methods through which they are trained. It’s important to be aware of the limitations of the tools that you’re using, especially if currency or accuracy is important for the tasks that you’re using Generative AI to complete.

  • Currently, LLMs function as pattern replicators, which means they generate output based on averages or probabilities of patterns.
  • LLMs are susceptible to hallucinations or the creation of nonsensical words, phrases, or ideas. This can also result in the generation of non-existent references .
  • LLMs do not fact-check, meaning that the information that they share is not guaranteed to be accurate or logical.
  • Many LLMs are pre-trained and have knowledge cut-off dates, meaning that data may be out of date or inaccurate. However, ongoing advancements, including the ability to access and process information in real time, have allowed some models to overcome this limitation by having the added ability to perform a web search.
  • LLMs are susceptible to reproducing biases found in their data sets, including but not limited to biases based on human biases that may be embedded in historical records, cultures, patterns of research, societal norms, and any other elements reflected in the text data used for their training. This will be discussed more in the Ethics section.

[POTENTIAL CHATBOT INTERFACE: GPT that analyses the steps above for any given input]

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

AI Literacy for Higher Education Copyright © by ddilkes2 is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.