Basic Text Processing – N-Grams
One of the basic operations in text processing is to identify and to analyze combinations or sequences of words to assess how the individual words are related. One tool used as part of this task is to generate n-grams, also known as Q-grams. N-grams were proposed in computational linguistics and probability theory. It is defined as a contiguous sequence of n words, phonemes, syllables, or other textual items (the n in n-gram) derived from a text. The specific type of text or text feature analyzed is application dependent. N-gram analysis is usually performed on a large collection or texts or from a corpus. A 1-gram is defined as a string of characters uninterrupted by a space or other punctuation. 1-grams include single words (including misspelled ones), abbreviations, and numbers. A sequence of 1-grams (or unigrams) comprises an n-gram, where n denotes the number of 1-grams. Thus, for instance, “barometric pressure” is a 2-gram (n = 2), known as a bigram, and the phrase “sleight of hand” is a 3-gram (n = 3), also called a trigram. For larger n, the n-gram is usually denoted by n, followed by the suffix “gram”, such as four-gram or ten-gram. Because of their usefulness, n-grams have wide applicability in the digital humanities. Beyond this domain, n-grams are used in auto-completion, spell-checking, voice-based assistant bots, and other applications in natural language processing (NLP) (Srinidhi).
To see how n-grams can be used in auto-completion applications, such as many current email systems, consider a system that employs the bigram (2-gram) model, along with basic probability theory. Given a large corpus or collection of texts, unigrams and bigrams are calculated for each word and length 2 sequence of words. Suppose that a user types a word, say, w1, and the goal is for the system to suggest the next word. Probabilities are then calculated for the next word following w1, say, w2. The suggested word, w2, is then selected from the (w1, w2) 2-gram with the highest probability, given w1. For instance, if word w1 has a 1-gram of 70 (i.e., word w1 occurs 70 times in the training corpus) and bigram (w1, w2) has a 2-gram of 14 (i.e., the 2-gram (w1, w2) occurs 14 times in the training corpus), then the probability that w2 occurs after w1 is p(w1, w2) / p(w1) = 14 / 70 = 1 / 5 = 0.2. The probabilities for all 2-grams beginning with w1 can be determined in the same way, and the second word in the 2-gram with the highest probability is selected as the suggested word following w1. As with any machine learning method, the algorithm must be trained on a large amount of data to ensure accurate probabilities of a word w2 occurring after a specified first word w1. Large corpora (collections) of text are often good sources of material on which to train these algorithms.