Bag-of-words representation of text

Consider the following text:

A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; \mathbb{R}^n denotes the set of vectors with n elements. If x \in \mathbb{R}^n denotes a vector, we use subscripts to denote elements, so that x_i is the i-th component of x. Vectors are arranged in a column, or a row. If x is a column vector, x^T denotes the corresponding row vector, and vice-versa.

The row vector x = [5,3,4] contains the number of times each word in the list {vector, of, the} appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the or of) are also ignored.

See also:

License

Hyper-Textbook: Optimization Models and Applications Copyright © by L. El Ghaoui. All Rights Reserved.

Share This Book