Google Colab Notebook: https://colab.research.google.com/drive/1-0yZZmDe6RDmeyrWNXOg-ruT4ZPjK6li#scrollTo=6r0uEFlXj4zZ
Yes, we have understood tokenization in the previous tutorial however, NLP Models don't like words much.
They are kind of numberverts, They like to work with numbers. For Neural Networks or GPTs to understand language, we need to supply them with words as numbers. So, in this tutorial, we are going to understand vectorization.
To Make a word meaningful to the NLP model, we follow this process.
1. Tokenize the text.
2. Assign a unique id to represent tokens as a number.
3. Vectorization is the process in which we assign randomly initialized n-dimensional vectors to these numbers.
4. Embedding: To give the vector a meaning, a neural network or model is trained on them. This allows the model to understand similar words. To achieve this, the word vectors are embedded into an embedding space. Resultedly, similar words have similar vectors after training.
Let's jump into the practical aspects of it, that might give us a much better insight. First let me demostrate some tools that can help us tokenize and convert the word to a numerical representation.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokens = tokenizer.tokenize("Happy Bday to me!")
print(tokens)
token_to_numbers = tokenizer.convert_tokens_to_ids(tokens)
print(token_to_numbers)
#Output
['happy', 'b', '##day', 'to', 'me', '!']
[3407, 1038, 10259, 2000, 2033, 999]
Let us try to tokenize, vectorize and generate word embedding for the word "Dog".
# geneation of word-vector for the word Dog
import torch
from transformers import BertModel
token_dog = tokenizer.convert_tokens_to_ids(["dog"])[0]
print("token for the word Dog is: ", token_dog)
model = BertModel.from_pretrained("bert-base-uncased")
word_vector_dog = model.embeddings.word_embeddings(torch.tensor([token_dog]))
#print("Word vector for dog is ", word_vector_dog)
print("shape of word vector is ", word_vector_dog.shape)
#Output
token for the word Dog is: 3899
shape of word vector is torch.Size([1, 768])
The word vector has a dimension of 1 row and 768 columns, the same as the BERT model. Why there are the 768 columns? Each of these columns corresponds to one of the characteristics of the token.
It's all good but how can we prove that the word-embedding works? Let me generate the word embedding for 'Dog', 'Wolf', and 'Fish' and then we can see it ourselves.
But how do we compare these arrays of numbers for similarity? You might have studied in school that cos(0) = 1 and cos(180) = -1. This means we can use the cosine function to compare two vectors. If they align with each other, the result should be close to 1 else close to -1.
# checking similarity of the word 'Dog', 'Wolf' and 'Fish'
token_dog_id = tokenizer.convert_tokens_to_ids(["dog"])[0]
embedding_dog = model.embeddings.word_embeddings(torch.tensor([token_dog_id]))
token_wolf_id = tokenizer.convert_tokens_to_ids(["wolf"])[0]
embedding_wolf = model.embeddings.word_embeddings(torch.tensor([token_wolf_id]))
token_fish_id = tokenizer.convert_tokens_to_ids(["fish"])[0]
embedding_fish = model.embeddings.word_embeddings(torch.tensor([token_fish_id]))
cos = torch.nn.CosineSimilarity(dim=1)
dog_wolf_similarity = cos(embedding_dog, embedding_wolf)
print("Dog-Wolf Similarty score is ", dog_wolf_similarity)
dog_fish_similarity = cos(embedding_dog, embedding_fish)
print("Dog-Fish Similarity score is ", dog_fish_similarity)
#Output:
Dog-Wolf Similarty score is tensor([0.3983], grad_fn=<SumBackward1>)
Dog-Fish Similarity score is tensor([0.3537], grad_fn=<SumBackward1>)
Notice, that the cosine score of Dog-Wolf is higher than Dog-Fish.
If you want to get an intuition of embedding, I would suggest playing around https://projector.tensorflow.org/.