Feature Extraction of Text Data using Bag of Words | NLP

The "Bag of Words" (BoW) is a popular and simple technique used in natural language processing (NLP) and information retrieval to represent text data in python. It's a way of converting text documents into numerical feature vectors, which can then be used for various machine learning tasks, such as text classification, sentiment analysis, or clustering.

The basic idea behind the Bag of Words approach is to treat each document as an unordered collection or "bag" of words, disregarding the grammar and word order, and only considering the frequency of each word occurrence.

You can watch the video-based tutorial with step by step explanation down below.

Let us see the working of this technique with the help of an example

Input data

First we will create a new sentence

text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']

Here, we have created a list of 2-3 sentences.

Import Module

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(stop_words='english')

The CountVectorizer class allows you to create BoW vectors directly from text data without manually going through the tokenization and vectorization steps.
By setting the stop_words parameter to 'english', you are instructing the CountVectorizer to remove common English stop words like "the," "and," "is," etc. These words are often filtered out in NLP tasks because they are frequently occurring and usually do not carry much important meaning.

Fit the data

Next we will have to fit the data that we we have created earlier.

# fit the data
bow.fit(text_data)

CountVectorizer(stop_words='english')

By calling bow.fit(text_data), the CountVectorizer scans the input text data, tokenizes each document, removes stop words (since stop_words='english'), and builds the vocabulary.
After this step, the CountVectorizer has learned the mapping of words to their corresponding positions in the vector space.

Display the vocabulary list

Next we will get the vocabulary list.

# get the vocabulary list
bow.get_feature_names()

['extraction', 'feature', 'good', 'important', 'interested', 'nlp', 'topic', 'tutorial']

After fitting the CountVectorizer with the text data and learning the vocabulary, you can obtain the list of feature names (words) in the vocabulary using the get_feature_names() method. This method returns a list of strings, where each element represents a word from the vocabulary.
The vocabulary_list variable will contain the list of words that were present in the text data after tokenization and stop words removal. These are the words that will be used as features in the BoW vectors, and each position in the vector corresponds to one of these words.
We can see that the vocabulary_list contains only the important words.

Transform the data

Next we will transform the data.

bow_features = bow.transform(text_data)
bow_features

<3x8 sparse matrix of type '<class 'numpy.int64'>' with 9 stored elements in Compressed Sparse Row format>

After fitting the CountVectorizer and learning the vocabulary from the text_data, you can use the transform method to create the Bag of Words (BoW) feature vectors for the given text_data.
The resulting bow_features will be a sparse matrix representing the BoW representation of the text data.

Visualize the BoW features

To better visualize the BoW feature vectors, you can convert the sparse matrix into a dense matrix using the toarray() method:

bow_feature_array = bow_features.toarray()
bow_feature_array

array([[0, 0, 0, 0, 1, 1, 0, 0], [0, 0, 2, 0, 0, 0, 1, 1], [1, 1, 0, 1, 0, 0, 1, 0]], dtype=int64)

You can convert the sparse matrix bow_features into a dense NumPy array using the toarray() method. This will give you a more readable representation of the BoW feature vectors.
In the dense matrix bow_feature_array, each row corresponds to a document (sentence) from the text_data, and each element in the row represents the frequency of the corresponding word in the document. The columns represent the words from the learned vocabulary.

Next let us print the values more clearly.

print(bow.get_feature_names())
for sentence, feature in zip(text_data, bow_feature_array):
    print(sentence)
    print(feature)

['extraction', 'feature', 'good', 'important', 'interested', 'nlp', 'topic', 'tutorial']

I am interested in NLP

[0 0 0 0 1 1 0 0]

This is a good tutorial with good topic

[0 0 2 0 0 0 1 1]

Feature extraction is very important topic

[1 1 0 1 0 0 1 0]

Let's print the vocabulary list obtained from the CountVectorizer using get_feature_names(), and then we'll display each sentence along with its corresponding Bag of Words (BoW) feature vector from the dense NumPy array bow_feature_array.
In the output, the vocabulary list contains the words present in the text data after tokenization and stop words removal.
For each sentence in the text_data, the corresponding BoW feature vector is displayed.
Each element in the feature vector represents the frequency of the corresponding word in the vocabulary for that sentence.
Note that words not present in a particular sentence have a frequency of 0 in the feature vector.

Final Thoughts

BoW is easy to understand and implement, making it an excellent starting point for text-based machine learning tasks, especially for beginners.
BoW generates a sparse vector representation, which is memory-efficient, especially when dealing with large text corpora.
BoW can be combined with various machine learning algorithms, such as Naive Bayes, Logistic Regression, and Support Vector Machines, making it versatile for different tasks like text classification, sentiment analysis, and topic modeling.
BoW can be applied to different languages, as it relies on the frequency of words rather than linguistic rules.
BoW disregards the word order, leading to the loss of important syntactic and semantic information. For instance, "good movie" and "movie good" will have the same BoW representation.
The BoW representation can result in a large feature space, especially for extensive vocabularies, which may lead to the "curse of dimensionality."
BoW does not consider the context in which words appear in a sentence, which may lead to ambiguity in meaning.

Overall, the Bag of Words approach serves as a foundation for understanding text data and has paved the way for more sophisticated NLP techniques that have emerged in recent years. When used judiciously and in combination with other techniques, BoW can be a valuable tool in the NLP practitioner's toolkit.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

Feature Extraction of Text Data using Bag of Words | NLP | Python