Consider the following text:
A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; [latex]\mathbb{R}^n[/latex] denotes the set of vectors with [latex]n[/latex] elements. If [latex]x \in \mathbb{R}^n[/latex] denotes a vector, we use subscripts to denote elements, so that [latex]x_i[/latex] is the [latex]i[/latex]-th component of [latex]x[/latex]. Vectors are arranged in a column, or a row. If [latex]x[/latex] is a column vector, [latex]x^T[/latex] denotes the corresponding row vector, and vice-versa. |
The row vector [latex]x = [5,3,4][/latex] contains the number of times each word in the list {vector, of, the}
appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the
or of
) are also ignored.
See also: Bag-of-words representation of text: measure of document similarity.