Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.
First, this is the graph that was referenced by almost all of the post related to Transformer.
Transformer consists of these parts: Input, Encoder*N, Output Input, Decoder*N, Output. I’ll explain them step by step.
Input The input word will map to 512 dimension vector.

\(s_t\) and \(h_i\) are source hidden states and target hidden state, the shape is (n,1). \(c_t\) is the final context vector, and \(\alpha_{t,s}\) is alignment score.
\[\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\
\alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned} \]
Global(Soft) VS Local(Hard) Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.
Content-based VS Location-based Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

PyTorch provide a simple DQN implementation to solve the cartpole game. However, the code is incorrect, it diverges after training (It has been discussed here).
The official code’s training data is below, it’s high score is about 50 and finally diverges.
There are many reason that lead to divergence.
First it use the difference of two frame as input in the tutorial, not only it loss the cart’s absolute information(This information is useful, as game will terminate if cart moves too far from centre), but also confused the agent when difference is the same but the state is varied.

PyTorch is a really powerful framework to build the machine learning models. Although some features is missing when compared with TensorFlow (For example, the early stop function, History to draw plot), its code style is more intuitive.
Torchtext is a NLP package which is also made by pytorch team. It provide a way to read text, processing and iterate the texts.
Google Colab is a Jupyter notebook environment host by Google, you can use free GPU and TPU to run your modal.

LSTM The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time.
Here is the structure of LSTM:
The calculate procedure are:
\[\begin{aligned} f_t&=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)\\
i_t&=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\
o_t&=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\
\tilde{C_t}&=tanh(W_C\cdot[h_{t-1},x_t]+b_C)\\
C_t&=f_t\ast C_{t-1}+i_t\ast \tilde{C_t}\\
h_t&=o_t \ast tanh(C_t) \end{aligned}\]
\(f_t\),\(i_t\),\(o_t\) are forget gate, input gate and output gate respectively. \(\tilde{C_t}\) is the new memory content. \(C_t\) is cell state.

Models CBOW (Continuous Bag of Words) Use the context to predict the probability of current word.
Context words’ vectors are \(\upsilon_{c-n} … \upsilon_{c+m}\) (\(m\) is the window size) Context vector \(\hat{\upsilon}=\frac{\upsilon_{c-m}+\upsilon_{c-m+1}+…+\upsilon_{c+m}}{2m}\) Score vector \(z_i = u_i\hat{\upsilon}\), where \(u_i\) is the output vector representation of word \(\omega_i\) Turn scores into probabilities \(\hat{y}=softmax(z)\) We desire probabilities \(\hat{y}\) match the true probabilities \(y\). We use cross entropy \(H(\hat{y},y)\) to measure the distance between these two distributions.

Here is a simple way to classify text without much human effort and get a impressive performance.
It can be divided into two steps:
Get train data by using keyword classification Generate a more accurate classification model by using doc2vec and label spreading Keyword-based Classification Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.

Here are some parameter in gensim’s doc2vec class.
window window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.
In skip-gram model, if the window size is 2, the training samples will be this:(the blue word is the input word)
min_count If the word appears less than this value, it will be skipped

As I said before, I’m working on a text classification project. I use doc2vec to convert text into vectors, then I use LPA to classify the vectors.
LPA is a simple, effective semi-supervised algorithm. It can use the density of unlabeled data to find a hyperplane to split the data.
Here are the main stop of the algorithm:
Let $ (x_1,y1)…(x_l,y_l)$ be labeled data, $Y_L = \{y_1…y_l\} $ are the class labels.