Chapter 1: Human Language and Language Science
1.8 How do ChatBots Synthesize Sentences?
In the previous section we learned that Large Language Models (LLMs) produce text that looks like human language by combining word forms in statistically plausible ways. In this section we dig into that word-combining process in more detail: how do LLMs synthesize realistic sentences?
To understand the logic that underlies a Large Language Model, we have to go back to Claude Shannon’s (1948) work, which was in turn inspired by Andrey Markov’s work from 1913 (Markov, 1913). Those papers are the origin of the modern notion of an n-gram (Schwartz, 2019).
An n-gram is some sequence of items that occur in a given order in a given corpus. Let’s start with an example of a bigram: that’s two items that occur together. For this example, let’s consider pairs of letters that occur together in a corpus of English text.
If we search through the entire text and find all the instances of the letter Q, we can then calculate the probability of each letter that follows Q, each time that Q appears in the text. If Q appears in the text 100 times, it’s fairly likely that 99 of those times the next letter will be U, because of how English spelling works. So our count of probabilities would resemble this table: Q is followed by U 99% of the time and Q is followed by any other letter less than 1% of the time.
In other words, the bigram QU has a high probability compared to the bigram QA or QB or QC.
On the other hand, if we count all the instances of letter E in our corpus and then calculate the probability of each following letter, we’d get quite different results. E is much more frequent than Q in English texts, but unlike Q, E can be followed by any letter of the alphabet. If the letter E appears 10,000 times in our corpus; then the probabilities for each following letter might look like this: E followed by A, maybe 3%; E followed by B, 1%; E followed by C, 1%; E followed by D, 5%, and so on.
The higher frequency of E relative to Q is not relevant here: what these bigram probabilities tell us is how likely Thing 2 is given that Thing 1 has just happened. When Thing 1 is a Q, Thing 2 is pretty likely to be a U. When Thing 1 is an E, Thing 2 could be any letter.
Within a Large Language Model (LLM), Thing 1 and Thing 2 are called tokens. A token could be a letter, but more likely a token is a word or a piece of a word, a morpheme, or even a punctuation mark. Let’s consider an example where the tokens are words and we’re calculating the probabilities for trigrams. So now we’re interested in the probability of Token 3, given that Token 1 and Token 2 have just occurred in that order.
If tokens 1 and 2 are proud of, then it’s pretty likely that token 3 will be you. In our imaginary corpus let’s give this trigram a probability of 40%. The trigrams proud of him and proud of them are also pretty likely. On the other hand, once you introduce a proper name, the probability of that specific set of three words gets lower.
An n-gram extends the same logic from 2 or 3 items to some number N of items. The probabilities for each n-gram are calculated from how frequently that sequence of tokens occurs in the corpus.
An LLM is trained on a vast corpus: billions of texts collected from millions of sources. You can imagine that calculating the exact probabilities of each sequence of tokens very quickly becomes unwieldy. An n-gram is a useful, simple illustration for understanding the idea of co-occurrence probabilities, but LLMs don’t actually use n-grams to generate text. Instead, they use an algorithm called a “neural network” (McCulloch & Pitts 1943). Here’s how Alex Hanna and Emily Bender explain it:
“Neural nets are composites of mathematical functions called “perceptrons” that each take in multiple inputs and then run a calculation to determine what value to output, based on those inputs. The perceptrons are connected in a network, such that the output of each can serve as the input to many others and each of those connections is associated with a “weight”, which can be interpreted as the strength of the influence of one on the next perceptron in the network.” (Bender & Hanna, 2025).
The weights in a neural net are what allows an LLM to predict whether a given token is likely or unlikely to follow another token. In a context like “This cake is ___” the model will calculate high probabilities for words like delicious, moist, or chocolatey, and low probabilities for words like crunchy, fast, or hairy.
Once a model is trained, it’s ready to generate text. When a user types in a prompt like “Help me describe this cake,” the model synthesizes a sentence that might begin with the words “This cake is…”. Then it draws on its training data to choose a probable token to place next in the sentence. Then it chooses another probable word given its previous choice. And another. And so on.
Most LLMs don’t choose a next token that has the absolute highest probability because this ends up producing a lot of text that sounds exactly the same. Instead, the creators have programmed these models to choose from among many high-probability words, so that the models synthesize sentences that have some variety in them. For a given prompt, the same model will output slightly different sentences each time.
Given the immense quantity of training data, you can imagine there are a lot of opportunities for things to go wrong. Before a model is released to the public, creators adjust the weights to create safety filters or guardrails. Depending on what the creators’ goals are, they might want to prevent their model from producing sentences that contain swear words, or racial slurs, or other offensive content. Or they might want the model to produce those things! Either way, they want some factor other than just the probabilities to play a role in synthesizing these sentences. So they refine the model with “reinforcement learning from human feedback”.
What that means is that human workers, often severely underpaid, rate and tag the LLM’s outputs, and then their ratings are fed back into the model. To understand how this reinforcement learning works, let’s consider another hypothetical example.
Suppose a user prompts:
“I had a bad experience with the receptionist at my doctor’s office. I like the doctor but the receptionist cancelled my appointment without telling me. Write a review that I can post online.”
The chatbot then responds:
“Dr. Doctor is professional and provides good medical care but his receptionist is incompetent. She cancelled my appointment but she did not inform me. I wasted my time going to an appointment that had been cancelled!”
Notice that the model has used a masculine pronoun for the doctor and feminine pronouns for the receptionist even though the prompt did not specify either person’s gender. If the human annotator wanted to change that gendered assumption, they could rate this as a low-quality response, or if they wanted to reinforce the gendered assumption, they could rate it as high quality. Their ratings then affect the outputs of the next version of the model.
To do their RLHF work effectively and prevent models from producing dangerous or hateful content, the human workers have to be exposed to dangerous and hateful content so they can tag it accurately. Many of these people, who are often working for low wages in developing nations, have described the trauma they experience from having to view horrific, violent, gruesome content day after day. Ephantus Kanyugi, the vice-president of the Data Labelers Association in Nairobi, reported, “You have to spend your whole day looking at dead bodies and crime scenes. Mental health support was not provided” (Agence France Presse, 2025).
Once the owners of an LLM chatbot decide it’s had enough training, they release it to human customers who interact with it in a similar way to how they message with humans. If you text your mom, “how long do I bake nana’s mac & cheese for”, she might reply, “I do it for 25 minutes at 350”.
That interaction feels very similar to what happens if you prompt a chatbot with the same question. When I entered the prompt, “how long do I bake the mac & cheese for,” Copilot suggested a time and temperature, and also synthesized this sentence: “Do you want me to give you a full step-by-step baked mac & cheese recipe?”
Notice that it used the personal pronouns I and you, not because Copilot is a person, but because its creators programmed it to make its outputs feel like a human conversation. The next time you give a prompt to an LLM chatbot, remember that it is not having a conversation with you!
A conversation involves humans communicating meaningful ideas with each other, and ideally understanding each other’s intentions. In contrast, a chatbot synthesizes text based on the probabilities in its training data, reinforced by feedback from underpaid humans with terrible working conditions. The chatbot’s creators have programmed it to make its outputs seem realistic, but it has no communicative intentions and no understanding – all it has is training data and algorithms. In later chapters we’ll spend more time exploring how humans generate and understand meaningful sentences. That will help you understand more deeply how what humans do is different from what chatbots do.
Check your Understanding
References
Agence France Presse. (2025). The gruelling, low-paid human work behind generative AI curtain. Canadian Affairs.
Bender, E. M., & Hanna, A. (2025). The AI Con: How to fight big tech’s hype and create the future we want. HarperCollins.
Markov, A. A. (1913). An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains. Science in Context, 19(4), 591–600.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
Schwartz, O. (2019). Andrey Markov & Claude Shannon Counted Letters to Build the First Language-Generation Models. IEEE Spectrum.
Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379–423, 623–656.