\(s_t\) and \(h_i\) are source hidden states and target hidden state, the shape is (n,1). \(c_t\) is the final context vector, and \(\alpha_{t,s}\) is alignment score.

\[\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\
\alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned} \]

Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Here are several popular attention mechanisms:

Dot-Product

\[score(s_t,h_i)=s_t^Th_i\]

Scaled Dot-Product

\[score(s_t,h_i)=\frac{s_t^Th_i}{\sqrt{n}}\] where n is the vectors dimension. Google’s Transformer model has similar scaling factor when calculate self-attention: \(score=\frac{KQ^T}{\sqrt{n}}\)

Location-Base

\[socre(s_t,h_i)=softmax(W_as_t)\]

General

\[score(s_t,h_i)=s_t^TW_ah_i\]

\(Wa\)’s shape is (n,n)

Concat

\[score(s_t,h_i)=v_a^Ttanh(W_a[s_t,h_i])\]

\(v_a\)’s shape is (x,1), and \(Wa\) ’s shape is (x,x). This is similar to a neural network with one hidden layer.

When I doing a slot filling project, I compare these mechanisms. Concat attention produce the best result.

Ref:

  1. Attention Variants
  2. Attention? Attention!
  3. Attention Seq2Seq with PyTorch: learning to invert a sequence