top of page
Writer's pictureHackers Realm

Feature Extraction using Word2Vec | NLP | Python

Word2Vec is a popular technique in natural language processing (NLP) for learning word embeddings, which are dense numerical representations of words in a continuous vector space and can be used in python. These embeddings capture semantic relationships between words, making them useful for various NLP tasks like machine translation, text classification, sentiment analysis, and more.

Word2Vec
Word2Vec

Word2Vec embeddings have proven to be very useful due to their ability to capture semantic similarities and relationships between words. Similar words often end up with similar vector representations, and algebraic operations on these vectors can produce meaningful results.



You can watch the video-based tutorial with step by step explanation down below.


Import Modules

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
  • gensim.test.utils - This imports a small set of example texts that are included with Gensim. These texts are often used for testing and demonstrating purposes.

  • gensim.models - This imports the Word2Vec model class from the Gensim library, allowing you to create and train Word2Vec models.


Display the Text Data


Let us see what we have in the common_texts module that we have imported.

# text data
common_texts

[['human', 'interface', 'computer'],

['survey', 'user', 'computer', 'system', 'response', 'time'],

['eps', 'user', 'interface', 'system'],

['system', 'human', 'system', 'eps'],

['user', 'response', 'time'],

['trees'],

['graph', 'trees'],

['graph', 'minors', 'trees'],

['graph', 'minors', 'survey']]

  • The common_texts variable contains a list of example sentences provided by the Gensim library for demonstration and testing purposes. These sentences are typically used to showcase the functionality of various text processing and natural language processing methods.

  • Each sublist within common_texts represents a sentence, and the words within each sublist represent the tokens in that sentence.


Fit the Model


Next let us initialize and fit the model.

# initialize and fit the data
model = Word2Vec(common_texts, size=100, min_count=1)
  • Here we're initializing and fitting a Word2Vec model using the common_texts data with the Gensim library.

  • common_texts is the list of sentences (text data) that you want to use for training the model.

  • size=100 specifies the dimensionality of the word embeddings. In this case, each word will be represented as a vector of length 100.

  • min_count=1 indicates the minimum number of occurrences a word must have in the training data to be included in the vocabulary. A value of 1 means that all words in the training data will be included.

  • After this code snippet, the model variable will contain the trained Word2Vec model, and you can use it to access word embeddings and perform various word-related operations.


Display the feature vector


Let us retrieve the word embedding for the word "graph" from the trained Word2Vec model.

model.wv['graph']

array([-0.00042112, 0.00126945, -0.00348724, 0.00373327, 0.00387501,

-0.00306736, -0.00138952, -0.00139083, 0.00334137, 0.00413064,

0.00045129, -0.00390373, -0.00159695, -0.00369461, -0.00036086,

0.00444261, -0.00391653, 0.00447466, -0.00032617, 0.00056412,

-0.00017338, -0.00464378, 0.00039338, -0.00353649, 0.0040346 ,

0.00179682, -0.00186994, -0.00121431, -0.00370716, 0.00039535,

-0.00117291, 0.00498948, -0.00243317, 0.00480749, -0.00128626,

-0.0018426 , -0.00086148, -0.00347201, -0.0025697 , -0.00409948,

0.00433477, -0.00424404, 0.00389087, 0.0024296 , 0.0009781 ,

-0.00267652, -0.00039598, 0.00188174, -0.00141169, 0.00143257,

0.00363962, -0.00445332, 0.00499313, -0.00013036, 0.00411159,

0.00307077, -0.00048517, 0.00491026, -0.00315512, -0.00091287,

0.00465486, 0.00034458, 0.00097905, 0.00187424, -0.00452135,

-0.00365111, 0.00260027, 0.00464861, -0.00243504, -0.00425601,

-0.00265299, -0.00108813, 0.00284521, -0.00437486, -0.0015496 ,

-0.00054869, 0.00228153, 0.00360572, 0.00255484, -0.00357945,

-0.00235164, 0.00220505, -0.0016885 , 0.00294839, -0.00337972,

0.00291201, 0.00250298, 0.00447992, -0.00129002, 0.0025 ,

-0.00430755, -0.00419162, -0.00029911, 0.00166961, 0.00417119,

-0.00209666, 0.00452041, 0.00010931, -0.00115822, -0.00154263],

dtype=float32)

  • Here, model refers to the trained Word2Vec model that you initialized and fitted using the common_texts data.

  • The .wv attribute gives you access to the word vectors, and ['graph'] is used to retrieve the word embedding for the word "graph".

  • The resulting embedding will be a numerical vector of the specified dimensionality (100 in this case) that represents the semantic information of the word "graph" in the continuous vector space learned by the Word2Vec model.


Next let us find words that are most similar in meaning to the word "graph".

model.wv.most_similar('graph')

[('interface', 0.1710839718580246),

('user', 0.08987751603126526),

('trees', 0.07364125549793243),

('minors', 0.045832667499780655),

('computer', 0.025292515754699707),

('system', 0.012846874073147774),

('human', -0.03873271495103836),

('survey', -0.06853737682104111),

('time', -0.07515352964401245),

('eps', -0.07798048853874207)]

  • model.wv.most_similar('graph') code is used to find words that are most similar in meaning to the word "graph" based on the learned word embeddings from the trained Word2Vec model. This method returns a list of word and similarity score pairs.

  • The resulting similar_words will be a list of tuples where each tuple contains a word and a similarity score. The words are ranked by their similarity to the word "graph," with the most similar word having the highest similarity score.

  • The specific similarity scores and words will depend on the quality of the trained model and the data it was trained on.


Final Thoughts

  • Word2Vec allows us to represent words as dense vectors in a continuous space, capturing the semantic relationships and contextual meaning between words. This enables better semantic understanding compared to traditional sparse representations like one-hot encoding.

  • Word embeddings produced by Word2Vec often exhibit interesting and useful properties. Words with similar meanings tend to have similar vector representations, and algebraic operations on these vectors can yield meaningful results (e.g., "king" - "man" + "woman" ≈ "queen").

  • Word2Vec uses unsupervised learning to learn these embeddings from large amounts of text data. It learns to encode semantic relationships between words without requiring explicit linguistic features or labeled data.

  • Word2Vec embeddings are widely used as input features for a variety of NLP tasks, including machine translation, text classification, sentiment analysis, named entity recognition, and more. Pre-trained Word2Vec embeddings are often used to improve performance on these tasks, especially when the available training data is limited.

  • While the original CBOW and Skip-gram models are foundational, there have been advancements and variations in the Word2Vec approach. Techniques like negative sampling and subword embeddings (fastText) have expanded the capabilities of word embeddings.

  • Word2Vec has some limitations. It might struggle with rare words or words that didn't appear in the training data. Additionally, it doesn't capture polysemy (multiple meanings of a word) very well.

  • The quality and size of the training data play a significant role in the quality of the learned embeddings. Larger, diverse, and domain-specific datasets tend to yield better results.

  • Training Word2Vec models requires computational resources, especially for large datasets and higher-dimensional embeddings. Pre-trained embeddings are available to mitigate this for many use cases.

  • While Word2Vec was a groundbreaking technique, subsequent methods like GloVe (Global Vectors for Word Representation) and Transformer-based models (BERT, GPT) have achieved even higher levels of performance and contextual understanding.

In conclusion, Word2Vec has been a transformative approach that paved the way for many advancements in NLP and computational linguistics. Its ability to capture and quantify semantic relationships between words has opened up numerous possibilities for improved language understanding and downstream applications.



Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

749 views
bottom of page