Original Source Here

## Parameters and loss function of Word2Vec

Recall that every word is represented by a vector of 100 numbers. Also recall that every word can serve as a center word or a context word. Thus, **every unique word in the corpus is represented by 2 vectors**: one vector for it as the center word and another for it as a context word.

The parameters theta of Word2Vec are the center and outside vectors of all the unique words in the corpus. The dimensions of theta are 2 times length of vocabulary rows and 100 columns.

In the image, I used lowercase *v* to represent the center word vector of a word, and lowercase *u* to represent the context word vector of a word. Thus, *v_zebra* is the vector for zebra when zebra is the center word. *u_zebra* is the vector for zebra when zebra is a context word of some other center word. Later on, I will use the capital *U* to refer to a matrix containing all of the outside vectors *u*.

The loss function we’ll use is the negative log likelihood (NLL) loss. NLL is used in many machine learning algorithms, like logistic regression.

The equation for NLL is:

We sum over length of unique words* T, *and at each index *t*, we sum the probability of each context word (within window *w*) given the center word at index *t*.

Thus, the loss of one context word given a center word is:

How do you compute the probability? We use the Softmax equation. Below is how you calculate the probability of a context word (o) given a center word (c).

The nominator is the exponent of the dot product of the context vector for the context word *o* with center vector for center word *c*. The denominator is the sum of the exponent of the dot product of each vocabulary word’s context vector with the center word vector. Wow that was a mouthful!

As you can probably already guess, this naive-softmax probability is expensive to compute since we have to sum over the entire vocabulary. There is another more efficient ways to do loss; I’ll describe it later.

In gradient descent, you take the derivative of the loss function with respect to the parameters to find how to update the parameters in each iteration.

For a given center word, we need to calculate the gradient for both the center word vector as well as the gradient for all the context word vectors.

I won’t do the derivations here, but to compute the gradient for center vector, you take the dot product of *U* and (y_hat minus y). *U* is, as I’ve mentioned, the matrix of all outside vectors of the entire vocabulary. To compute the gradients of outside vectors, you take the dot product of the center vector and (y_hat minus y) transposed. y_hat in this case is the Softmax of the dot product of *U* and the center vector.

Overall, the algorithm works like so:

For each iteration:

- Calculate the loss and gradients of a mini-batch (e.g. of size 50).
- Update model parameters like so:

`theta = theta — learning_rate * gradients`

In order to calculate the gradients of a mini-batch of size 50, you iterate 50 times, and during each iteration, you:

- Get a random center word and its context words.
- For each context word, calculate the gradients (by taking the derivative of the loss function).
- Aggregate all the gradients of the context words.

The aggregated gradients of a mini-batch are then used to update theta.

Here’s a Github gist with **some** of the relevant code. Note: I’ve excluded some parts like parsing the corpus and the code for Softmax equation.

The algorithm I just described is called the Skip-Gram algorithm. In Skip-Gram, you predict the probability of a context word given a center word. There’s also another algorithm called CBOW, where you predict the probability of center word given its context instead.

As I’ve mentioned, the loss function we used is expensive to compute. There’s another technique for training Word2Vec called “negative sampling.” In negative sampling, instead of the cross-entropy loss (i.e. negative log likelihood of Softmax function), you train a binary logistic regression for a true pair vs. several noise pairs. The true pair is a center word and one of its context words. A noise pair is the same center word and a random word. In negative sampling, the loss function is much more efficient since you’re not summing over the whole vocabulary.

# Part 2. How Word2Vec solves analogies

*Man is to king as woman is to what?*

When you ask a Word2Vec algorithm to solve this analogy, you’re essentially asking it to find a word that’s most SIMILAR to “king” and “woman”, but most DISSIMILAR to “man.” The similarity here refers to the **cosine similarity of two vectors**.

Thus, after training, the word vector for “queen” is most similar to the vectors for “king” and “woman”, but dissimilar to “man.”

Additionally, since similar words are clustered together in the higher dimensional space, you can think of the vector for “queen” as approximately the vector for “king” minus the vector for “man” plus the vector for “woman.”

And here it is, how we can train computers to solve analogies!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot