### Models and Architectures in Word2vec

Models CBOW (Continuous Bag of Words) Use the context to predict the probability of current word. Context words’ vectors are \(\upsilon_{c-n} … \upsilon_{c+m}\) (\(m\) is the window size) Context vector \(\hat{\upsilon}=\frac{\upsilon_{c-m}+\upsilon_{c-m+1}+…+\upsilon_{c+m}}{2m}\) Score vector \(z_i = u_i\hat{\upsilon}\), where \(u_i\) is the output vector representation of word \(\omega_i\) Turn scores into probabilities \(\hat{y}=softmax(z)\) We desire probabilities \(\hat{y}\) match the true probabilities \(y\). We use cross entropy \(H(\hat{y},y)\) to measure the distance between these two distributions.