## Models

### CBOW (Continuous Bag of Words)

Use the context to predict the probability of current word.

1. Context words’ vectors are $$\upsilon_{c-n} … \upsilon_{c+m}$$ ($$m$$ is the window size)
2. Context vector $$\hat{\upsilon}=\frac{\upsilon_{c-m}+\upsilon_{c-m+1}+…+\upsilon_{c+m}}{2m}$$
3. Score vector $$z_i = u_i\hat{\upsilon}$$, where $$u_i$$ is the output vector representation of word $$\omega_i$$
4. Turn scores into probabilities $$\hat{y}=softmax(z)$$
5. We desire probabilities $$\hat{y}$$ match the true probabilities $$y$$.

We use cross entropy $$H(\hat{y},y)$$ to measure the distance between these two distributions. $H(\hat{y},y)=-\sum_{j=1}^{\lvert V \rvert}{y_j\log(\hat{y}_j)}$

$$y$$ and $$\hat{y}$$ is accurate, so the loss simplifies to: $H(\hat{y},y)=-y_j\log(\hat{y})$

For perfect prediction, $$H(\hat{y},y)=-1\log(1)=0$$

According to this, we can create this loss function:

\begin{aligned} minimize\ J &=-\log P(\omega_c\lvert \omega_{c-m},…,\omega_{c-1},…,\omega_{c+m}) \\ &= -\log P(u_c \lvert \hat{\upsilon}) \\ &= -\log \frac{\exp(u_c^T\hat{\upsilon})}{\sum_{j=1}^{\lvert V \rvert}\exp (u_j^T\hat{\upsilon})} \\ &= -u_c^T\hat{\upsilon}+\log \sum_{j=1}^{\lvert V \rvert}\exp (u_j^T\hat{\upsilon}) \end{aligned}

### Skip-Gram

Use current word to predict its context.

1. We get the input word’s vector $$\upsilon_c$$
2. Generate $$2m$$ score vectors, $$u_{c-m},…,u_{c-1},…,u_{c+m}$$.
3. Turn scores into probabilities $$\hat{y}=softmax(u)$$
4. We desire probabilities $$\hat{y}$$ match the true probabilities $$y$$.

\begin{aligned} minimize J &=-\log P(\omega_{c-m},…,\omega_{c-1},\omega_{c+1},…\omega_{c+m}\lvert \omega_c)\\ &=-\log \prod_{j=0,j\ne m}^{2m}P(\omega_{c-m+j}\lvert \omega_c)\\ &=-\log \prod_{j=0,j\ne m}^{2m}P(u_{c-m+j}\lvert \upsilon_c)\\ &=-\log \prod_{j=0,j\ne m}^{2m}\frac{\exp (u^T_{c-m+j}\upsilon_c)}{\sum_{k=1}^{\lvert V \rvert}{\exp (u^T_k \upsilon_c)}}\\ &=-\sum_{j=0,j\ne m}^{2m}{u^T_{c-m+j}\upsilon_c+2m\log \sum_{k=1}^{\lvert V \rvert} \exp(u^T_k \upsilon_c)} \end{aligned}

## Architectures

Minimize $$J$$ is expensive, as the summation is over $$\lvert V \rvert$$. There are two ways to reduce the computation. Hierarchical Softmax and Negative Sampling.

### Hierarchical Softmax

Encode words into a huffman tree, then each word has a Huffman code. The probability of it’s probability $$P(w\lvert Context(\omega))$$ can change to choose the right path from root the the leaf node, each node is a binary classification. Suppose code $$0$$ is a positive label, $$1$$ is negative label. If the probability of a positive classification is $\sigma(X^T_\omega \theta)=\frac{1}{1+e^{-X^T_\omega}}$

Then the probability of negative classification is $1-\sigma(X^T_\omega \theta)$

\begin{aligned} p(d_2^\omega\lvert X_\omega,\theta^\omega_1&=1-\sigma(X^T_\omega \theta^\omega_1))\\ p(d^\omega_3\lvert X_\omega,\theta^\omega_2&=\sigma(X^T_\omega \theta^\omega_2))\\ p(d^\omega_4\lvert X_\omega,\theta^\omega_3&=\sigma(X^T_\omega \theta^\omega_3))\\ p(d^\omega_5\lvert X_\omega,\theta^\omega_4&=1-\sigma(X^T_\omega \theta^\omega_4))\\ \end{aligned}

where $$\theta$$ is parameter in the node.

The probability of the 足球 is the production of these equation.

Generally,

$p(\omega\lvert Context(\omega))=\prod_{j=2}^{l\omega}p(d^\omega_j\lvert X_\omega,\theta^\omega_{j-1})$

### Negative Sampling

Choose some negative sample, add the probability of the negative word into loss function. Maximize the positive words’ probability and minimize the negative words’ probability.

Let $$P(D=0 \lvert \omega,c)$$ be the probability that $$(\omega,c)$$ did not come from the corpus data. Then the objective function will be

$\theta = \text{argmax} \prod_{(\omega,c)\in D} P(D=1\lvert \omega,c,\theta) \prod_{(\omega,c)\in \tilde{D}} P(D=0\lvert \omega,c,\theta)$

where $$\theta$$ is the parameters of the model($$\upsilon$$ and $$u$$).

Ref: