Bag-of-words representation of text
Consider the following text:
A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector;
denotes the set of vectors with
elements. If
denotes a vector, we use subscripts to denote elements, so that
is the
-th component of
. Vectors are arranged in a column, or a row. If
is a column vector,
denotes the corresponding row vector, and vice-versa.
The row vector contains the number of times each word in the list
{vector, of, the}
appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the
or of
) are also ignored.
See also: